Unifying flow, stereo and depth estimation

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, Andreas Geiger

Research output: Contribution to journalArticleResearchpeer-review

72 Citations (Scopus)

Abstract

We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.

Original languageEnglish
Pages (from-to)13941-13958
Number of pages18
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume45
Issue number11
DOIs
Publication statusPublished - 1 Nov 2023

Keywords

  • Costs
  • cross-attention
  • Dense correspondence
  • depth
  • Estimation
  • Optical flow
  • optical flow
  • Solid modeling
  • stereo
  • Task analysis
  • Three-dimensional displays
  • Transformer
  • Transformers

Cite this