MonST3R

A Simple Approach for Estimating Geometry in the Presence of Motion (ICLR 2025)

MonST3R - A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, Ming-Hsuan Yang

paper :
https://arxiv.org/abs/2410.03825
project website :
https://monst3r-project.github.io/

Contribution

static scene에 사용됐던 DUSt3R를 dynamic scene에 확장한 버전!

geometry-first approach that directly estimates per-timestep geometry (pointmap) of dynamic scene
- 이전까지의 논문들은 [1], [2], [3], [4], [5], [6], [7], [8], [9], [10] 처럼
  depth, optical flow, trajectory estimation을 사용하는 subtasks로 쪼갠 뒤
  global optimization 또는 multi-stage pipeline 등으로 합치는
  complex system을 쓰는데,
  이는 보통 느리고, 다루기 힘들고, prone-to-error at each step
- 이전까지의 논문들은 motion과 geometry를 함께 사용하여 dynamic scene을 다뤘는데,
  motion, depth label, camera pose 정보가 있는 GT dynamic video data는 거의 없다
  (그래서 다른 model(prior)를 쓰는데, 이는 부정확성이 쌓일 수 있음)
- 대신 본 논문은
  limited data로 only DUSt3R의 decoder and head만 fine-tuning하여 (small-scale fine-tuning)
  explicit motion representation 없이
  only geometry (pointmap)를 directly 예측하는 pipeline 제시!
- each timestep마다 DUSt3R 방식으로 pointmap (geometry) 예측한 뒤
  같은 camera coordinate frame (global pointmap)에 대해 3D align
- downstream tasks :
  예측한 pointmap (geometry) 를 바탕으로
  feed-forward 4D reconstruction 뿐만 아니라
  video depth estimation, camera pose estimation, video segmentation 등
  여러 downstream video-specific tasks에 적용
Final Loss :
\(\hat X = \text{argmin}_{X, P_{W}, \sigma} L_{align} (X, \sigma, P_{W}) + w_{smooth} L_{smooth} (X) + w_{flow} L_{flow} (X)\)
- 세 가지 loss :
  3D alignment loss \(L_{align}\), camera trajectory smoothness loss \(L_{smooth}\), flow projection loss \(L_{flow}\)
- learnable param. :
  global pointmap \(X\), global pointmap으로의 3D alignment transformation \(P_{W}\), scale factor \(\sigma\) 를 업데이트하는데,
  얘네는 본질적으로 re-parameterization에 의해 depthmap \(\hat D\), extrinsic \(\hat P\), intrinsic \(\hat K\) 로 구성되어 있음
  즉, MonST3R는 결국 jointly optimize video depthmap and camera pose (extrinsic, intrinsic)

DUSt3R :
DUSt3R를 바로 dynamic scene에 적용할 경우 두 가지 한계 발생
- 문제 1) (static scene인 것처럼) fg object에 align하여 bg가 misaligned
  DUSt3R는 static scene으로만 학습됐기 때문에
  dynamic scene의 pointmaps를 알맞게 align하지 못하여
  moving fg object가 가만히 있는 것처럼 align되고
  static bg element는 misaligned
- 문제 2) fg object의 geometry(depth)를 잘 예측하지 못하여 fb object를 bg에 둠
- 해결)
  domain mismatch이므로 다시 train!
  본 논문은 limited data를 최대한 사용하여 small-scale fine-tuning하는 training strategy 제시

motion mask :
DUSt3R는 static scene으로 훈련되었기 때문에 dynamic scene에 적용하기 위해
GT motion mask를 사용할 수도 있다
- inference할 때
  image의 dynamic region은 black pixels로 대체하고
  corresponding tokens는 mask tokens로 대체하여
  dynamic objects를 masking out 할 수도 있는데,
  black pixels와 mask tokens는 out-of-distribution w.r.t training 이므로
  pose estimation 결과가 안 좋아짐
- 본 논문은 그렇게 무작정 dynamic region을 mask out 하지 않고 이 문제 해결!

Architecture

Method

Main Idea

DUSt3R의 아이디어를 그대로 가져오고,
DUSt3R의 각 output pointmap \(X^{t} \in R^{W \times H \times 3}\) 이 time 정보 \(t\) 를 가지고 있음

Training Dataset

real-world dynamic scene은 보통 GT camera pose를 가지고 있지 않으므로
SfM 등 sensor measurement 또는 post-processing을 통해 추정하는데
이는 부정확할 수 있고 costly하므로
본 논문은 GT camera pose, depth 정보를 알 수 있는 synthetic datasets를
dynamic fine-tuning을 위한 training dataset으로 사용

Training Dataset for Dynamic Fine-Tuning :
PointOdyssey는 dynamic objects 많아서 많이 사용하고
TartanAir는 static scene이라서 적게 사용하고
Waymo는 specialized domain이라서 적게 사용
- 3 synthetic datasets :
  - PointOdyssey (Zheng et al.)
  - TartanAir (Wang et al.)
  - Spring (Mehl et al.)
- 1 real-world dataset :
  - Waymo (Sun et al.) with LiDAR

Training Strategy

dataset이 small-scale이므로
data efficiency를 극대화시키기 위해
다양한 training techniques 사용

Training Strategies :
- 전략 1)
  encoder는 freeze한 뒤
  network의 decoder와 prediction head만 fine-tune
  (encoder(CroCo)의 geometric knowledge는 유지)
- 전략 2)
  each video마다 temporal stride 1~9 만큼 떨어진 two frames를 sampling하여 input pair로 사용하는데,
  stride가 클수록 sampling prob.도 linearly 큼
  \(\rightarrow\)
  서로 더 멀리 떨어진 frame pair, 즉 large motion에 more weights 부여
- 전략 3)
  Field-of-View augmentation (center crop with various image scales) 사용하여
  다양한 camera intrinsics에도 일반화 가능하도록!
  (training videos에는 해당 variation이 흔하지 않음)

Dynamic Global Point Clouds and Camera Pose

frame 수가 많기 때문에
pairwise pointmap 들로부터 직접 하나의 dynamic global point cloud를 추출하는 건 어렵.
지금부터 pairwise model을 이용해서
dynamic global pcd \(\hat X\) 와 camera pose \(\hat K, \hat P = [\hat R | \hat T]\) 를 동시에 optimize하는 방법을 소개하겠다

Video Graph :
- DUSt3R의 경우
  global alignment를 위해 모든 frame pair에 대해 connectivity graph를 만드는데,
  dynamic scene video에 대해 이렇게 graph 만드려면 too expensive
- 계산량 줄이기 위해
  전체 frames에 대해 graph를 만드는 게 아니라
  video의 sliding temporal window 내에 있는 frames에 대해 국소적인 graph 만듦
- sliding temporal window 내에 있는 모든 each frame pair
  \((t, t') \in W^{t} = {(a, b) | a, b \in [t, \ldots, t + w], a \neq b}\) 에 대해
  (\(w\) : temporal window size)
  MonST3R로 pairwise pointmap을 구하고,
  off-the-shelf method로 optical flow 구함
- runtime 줄이기 위해 strided sampling 적용

Dynamic Global Point Cloud and Pose Optimization :
- goal :
  모든 pairwise pointmaps를 같은 global coordinate frame에 모아서 world-coordinate pointmap \(X^{t} \in R^{H \times W \times 3}\) 만들기
- re-parameterization :
  - notation :
    \(P^{t} = [R^{t} | T^{t}]\) : extrinsic camera pose
    \(K^{t}\) : intrinsic
    \(D^{t}\) : per-frame depthmap
  - global pointmap \(X^{t}\) :
    depthmap, intrinsic, extrinsic 을 이용하여 parameterize global pointmap
    \(X_{i,j}^{t} = P^{t^{-1}} h (K^{t^{-1}} [i D_{i,j}^{t} ; j D_{i,j}^{t} ; D_{i,j}^{t}])\)
    - intrinsic \(K^{t^{-1}}\) :
      depthmap 정보를 2D pixel-coordinate \((i, j)\) 에서 3D camera-coordinate으로 변환한 뒤
    - homogeneous mapping \(h(\cdot)\) :
      homogeneous-coordinate으로 변환한 뒤
      (\(R^{t}, T^{t}\) 를 하나의 행렬로 표현 가능하도록 하여 just 행렬 곱셈을 통해 변환 가능)
    - extrinsic \(P^{t^{-1}}\) :
      world-coordinate으로 변환
- loss :
  dynamic scene video이기 때문에 DUSt3R의 3D alignment loss 뿐만 아니라 두 가지 video-specific loss 추가
  - Loss 1) DUSt3R의 3D alignment loss :
    - goal :
      각 pairwise pointmap \(X^{t; t \leftarrow t'}\), \(X^{t'; t \leftarrow t'}\) 을
      world-coordinate의 global pointmap \(X^{t}\) 에 align시키는
      single rigid transformation \(P^{t;e}\)
      (\(X^{t; t \leftarrow t'}\) 와 \(X^{t'; t \leftarrow t'}\) 는 둘 다 이미 같은 camera-coordinate (\(t\) 의 frame) 에 align되어 있으므로
      \(X^{t; t \leftarrow t'}\) 을 global pointmap에 align시키는 \(P\) 와
      \(X^{t'; t \leftarrow t'}\) 을 global pointmap에 align시키는 \(P\) 는 같음)
    - how :
      \(L_{align}(X, \sigma, P_{W}) = \sum_{W^{i} \in W} \sum_{e \in W} \sum_{t \in e} \| C^{t; e} \cdot (X^{t} - \sigma^{e} P^{t;e} X^{t;e}) \|_{1}\)
      - notation :
        \(W^{i} \in W\) : each sliding temporal window
        \(e = (t, t') \in W^{i}\) : each frame pair within the window
        \(t \in e\) : each frame
        \(\sigma^{e}\) : frame 크기 차이를 보정하는 per-(frame pair) scale factor
        \(P_{W}\) : sliding temporal window 내의 여러 frame pair에 대한 3D alignment transformation 집합
  - Loss 2) camera trajectory smoothness loss :
    - goal :
      nearby timestep에 대해 \(R, T\) 가 크게 변하지 않도록 하여
      시간에 따라 camera motion (extrinsic) 이 smooth하도록
    - how :
      \(L_{smooth}(X) = \sum_{t=0}^{N} (\| R^{t^{T}} R^{t+1} - I \|_{f} + \| R^{t^{T}} (T^{t+1} - T^{t}) \|_{2})\)
  - Loss 3) flow projection loss :
    - goal :
      confident static region에 대해
      global pointmaps \(X^{t}\) 와 camera poses \(K^{t}, R^{t}, T^{T}\), 즉 camera motion만으로 계산한 optical flow가
      off-the-shelf method가 내놓은 optical flow와 consistent하도록
    - how :
      \(L_{flow}(X) = \sum_{W^{i} \in W} \sum_{t \rightarrow t' \in W^{i}} \| S^{global; t \rightarrow t'} \cdot (F_{cam}^{global; t \rightarrow t'} - F_{est}^{t \rightarrow t'}) \|_{1}\)
      - \(S^{global; t \rightarrow t'}\) : static region에 대해 loss term 걸어줌
        (static mask 구하는 방법 : 아래의 Confident Static Regions 섹션에서 설명!)
        (\(X^{t}\) 가 learnable 하므로 학습 중에 계속 updated)
      - \(F_{cam}^{global; t \rightarrow t'}\) : global pointmap \(X^{t}\) 에 camera motion (intrinsic, extrinsic)을 적용하여 계산한 optical flow field
      - \(F_{est}^{t \rightarrow t'}\) : off-the-shelf method가 내놓은 optical flow field
Final Loss :
\(\hat X = \text{argmin}_{X, P_{W}, \sigma} L_{align} (X, \sigma, P_{W}) + w_{smooth} L_{smooth} (X) + w_{flow} L_{flow} (X)\)
- learnable param. :
  global pointmap \(X\), global pointmap으로의 3D alignment transformation \(P_{W}\), scale factor \(\sigma\) 를 업데이트하는데,
  얘네는 본질적으로 re-parameterization에 의해 depthmap \(\hat D\), extrinsic \(\hat P\), intrinsic \(\hat K\) 로 구성되어 있음
  즉, MonST3R는 결국 jointly optimize video depthmap and camera pose (extrinsic, intrinsic)
- \(w_{smooth} = 0.01, w_{flow} = 0.01\)
  (\(L_{flow} \lt 20\) 일 때, 즉 camera poses를 roughly align한 뒤에 \(L_{flow}\) 활성화함)
  (\(L_{flow} \gt 50\) 일 때, 즉 초기에 motion mask is updated)

Downstream Applications

Intrinsics and Relative Pose Estimation

Intrinsic Pose Estimation :
time \(t\) 에서의 pointmap \(X^{t}\) 을 이용해서
2D image와 3D pointmap이 align되도록 하는
focal length \(f^{t}\) 를 추정함으로써
camera intrinsic \(K^{t}\) 추정
Relative Pose Estimation :
DUSt3R와 달리 dynamic objects는
epipolar matrix 또는 Procrustes alignment를 위한 가정(???)들에 위배
\(\rightarrow\)
대신 주어진 3D point와 corresponding 2D point를 바탕으로 추정하는 PnP algorithm(???)과
random sampling 방식의 RANSAC algorithm(???) 사용
(dynamic scene이어도 대부분의 pixels는 static할 것이므로
randomly-sampled points는 static elements에 더 가중치를 두기 때문에
relative pose는 inliers(static)로 robustly estimate 가능)

Confident Static Regions

Static Mask :
단순하게 두 optical flow field가 일치하는 (차이가 적은) 영역을 Static Region으로 간주!
static mask \(S^{t \rightarrow t'} = [\alpha \gt \| F_{cam}^{t \rightarrow t'} - F_{est}^{t \rightarrow t'} \|_{1}]\)
(이 confident static mask를 나중에 global pose optimization에도 사용할 거임!)
- \(F_{cam}^{t \rightarrow t'}\) :
  camera motion만으로 optical flow 추정
  - Step 1)
    frame pair \(I^{t}\), \(I^{t'}\) 로부터
    pointmaps \(X^{t; t \leftarrow t'}\), \(X^{t'; t \leftarrow t'}\) 와 pointmaps \(X^{t; t' \leftarrow t}\), \(X^{t'; t' \leftarrow t}\) 추정
  - Step 2)
    위에서 언급한 방법으로 Intrinsics \(K^{t}, K^{t'}\) 와 Relative Pose \(R^{t \rightarrow t'}, T^{t \rightarrow t'}\) 추정
  - Step 3)
    only camera motion \(t \rightarrow t'\) 이용해서
    optical flow field \(F_{cam}^{t \rightarrow t'}\) 추정
    \(F_{cam}^{t \rightarrow t'} = \pi (D^{t; t \leftarrow t'} K^{t'} R^{t \rightarrow t'} K^{t^{-1}} \hat x + K^{t'} T^{t \rightarrow t'}) - x\)
    - notation :
      \(x\) : pixel-coordinate
      \(\hat x\) : homogeneous coordinate
      \(\pi(\cdot)\) : projection operation \((x,y,z) \rightarrow (\frac{x}{z}, \frac{y}{z})\)
      \(D^{t; t \leftarrow t'}\) : estimated depth from pointmap \(X^{t; t \leftarrow t'}\)
    - Step 3-1)
      intrinsic 이용하여 frame \(t\) 에 대해 2D에서 3D로 backproject
    - Step 3-2)
      camera motion (relative camera pose) \(t \rightarrow t'\) 적용
      (\(R^{t \rightarrow t'} X + T^{t \rightarrow t'}\))
    - Step 3-3)
      intrinsic 이용하여 frame \(t'\) 에 대해 다시 3D에서 2D image coordinate으로 project
- \(F_{est}^{t \rightarrow t'}\) :
  off-the-shelf method [11] 이용해서 optical flow 추정

Video Depth

optimal global pointmap \(\hat X\) 자체가 re-parameterization에 의해
per-frame depthmap \(\hat D\) 로 이루어져 있고,
just \(\hat D\) 자체가 video depth

Experiment

Results

Single-Frame and Video Depth Estimation :
- baseline :
  - video depth method :
    NVDS [12]
    ChronoDepth [13]
    DepthCrafter [14]
  - single-frame depth method :
    Depth-Anything-V2 [15]
    Marigold [16]
    DUSt3R blog
  - joint video depth and pose estimation method :
    CasualSAM [17]
    Robust-CVD [18]
- benchmark dataset :
  - video depth :
    KITTI
    Sintel
    Bonn
  - monocular single-frame depth :
    NYU-v2
- metric : [14], [15] 에서처럼
  Abs Rel : absolute relative error
  \(\sigma \lt 1.25\) : percentage of inlier points ???
  (All methods output scale- and/or shift- invariant depth estimates. For video depth evaluation, we align a single scale and/or shift factor per each sequence, whereas the single-frame evaluation adopts per-frame median scaling, following DUSt3R) ???
- results :
  - video depth :
    MonST3R는 specialized video depth estimation techniques와 유사한 성능
  - single-frame depth :
    DUSt3R 구조를 dynamic scene’s video에 대해 fine-tuning했는데도
    single-frame depth estimation에 대해 여전히 DUSt3R와 유사한 성능

Camera Pose Estimation :
obtain camera trajectories
- baseline :
  - joint video depth and pose estimation method (경쟁자들) :
    CasualSAM [17]
    Robust-CVD [18]
  - learning-based visual odometry method :
    DROID-SLAM (GT intrinsic 필요)
    Particle-SfM (Ours보다 5배 느림)
    DPVO (GT intrinsic 필요)
    LEAP-VO (GT intrinsic 필요)
  - DUSt3R with GT motion mask :
    단순히 dynamic region의 pixel과 token을 mask out
    (Related Works 섹션의 motion mask에서 설명함)
- benchmark dataset :
  Sintel
  TUM-dynamics
  ScanNet
- metric : Particle-SfM, LEAP-VO에서처럼
  Sim(3) Umeyama alignment 적용한 뒤 ???
  Absolute Translation Error (ATE)
  Relative Translation Error (RPE trans)
  Relative Rotation Error (RPE rot)
- results :
  joint depth and pose estimation methods 중에 제일 성능 좋고
  GT intrinsic 필요 없는데도 pose-specific methods와 유사한 성능

Ablation Study

Ablation Study :
- training dataset :
  datasets 섞어 쓰면 camera pose estimation에 도움!
- fine-tuning strategy :
  only decoder and head만 fine-tuning하는 게 다른 training strategies보다 나음!
- loss :
  본 논문에서 언급한 세 가지 loss (\(L_{align}, L_{smooth}, L_{flow}\)) 는
  video depth accuracy를 크게 해치지 않으면서 pose estimation accuracy를 높임!

Limitation

Limitation :
- 이론적으로는 dynamic camera intrinsics를 estimate할 수 있지만,
  사실상 이는 careful hyperparameter tuning과 manual constraints를 필요로 함
- out-of-distribution inputs에 struggle
  e.g. 건물 내부 또는 도심 야외 등으로 훈련한 경우 넓은 공터 같은 새로운 scene에 대해서는 제대로 작동 안함
  - 해결법 :
    training set을 확장하면 MonST3R가 in-the-wild videos에 대해서도 더 robust해질 듯

Question

Q1 :
왜 static region에 대해서만 optical flow가 consistent하도록 하는 flow projection loss 적용함?
A1 :
ddd
Q2 :
Gaussian Marbles에서는 frame끼리 divide-and-conquer로 merge하면서 frame 전후 관계를 trajectory로 연결하여 temporal info.를 이용함.
MonST3R는 각 timestep의 pointmap을 global pointmap으로 align하는데 camera trajectory smoothness loss 말고 3D alignment loss에서 frame 전후 관계, 즉 temporal info.를 이용함?
A2 :
TBD