Feed Forward Bullet Time Reconstruction

Feed Forward Bullet Time Reconstruction of Dynamic Scenes from Monocular Videos (CVPR 2025)

Feed Forward Bullet Time Reconstruction of Dynamic Scenes from Monocular Videos

Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, Jiahui Huang

paper :
https://arxiv.org/abs/2412.03526
project website :
https://research.nvidia.com/labs/toronto-ai/bullet-timer/

Contribution

model :
- motion-aware feed-forward model for real-time recon. and novel-view-synthesis of dynamic scenes
- obtain scalability and generalization by using both static and dynamic scene datasets
  (static and dynamic recon.에 모두 사용 가능)
- Procedure :
  - Step 1) pre-train on large static scene dataset
  - Step 2) video duration or FPS에 구애받지 않고 scale effectively across datasets
  - Step 3) output multi-view volumetric video representation
- recon. a bullet-time scene within 150ms with SOTA performance on a single GPU
  from 12 context frames of \(256 \times 256\) resolution
- bullet (target) timestamp \(t_{b}\) 를 원하는 each timestamp in video로 설정하면
  full video를 recon.할 수 있으므로
  각 frame을 parallel하게 recon. 가능

BulletTimer (Novelty 1.) :
main model
- recon. at arbitrary target (bullet) timestamp and arbitrary novel-view
  by adding bullet-time embedding to all the context (input) frames
  and aggregating pred. from all the context frames
NTE Module (Novelty 2.) :
pre-processing (FPI)
- fast motion에 대응하기 위해
  model에 feed하기 전에
  intermediate (interpolated) frames를 predict
- inference할 때
  arbitrary target (bullet) timestamp에 대해
  recon.할 수 있도록 도움

Introduction

Dynamic scene recon. from monocular video :
still challenging
due to inherently ill-posed (해가 무수히 많음) nature of dynamic recon. from limited observations
Static scene recon. :
- optimization-based (per-scene) :
  NeRF, HyperNeRF
- learning-based (feed-forward) :
  MonoNeRF, GS-LRM

Dynamic scene recon. :
dynamic scene은 complex motion 때문에 ambiguity 존재
이를 해소하는 데 도움될 data prior 필요
- optimization-based (per-scene) :
  - use contraint (data prior) like depth and optical flow [1], [2], [3], [4]
    \(\rightarrow\)
    given data와 위의 data prior 간의 싱크를 맞추는 게 challenging [5], [6]
  - per-scene approach는 time-consuming and thus scale 어렵
- learning-based (feed-forward) :
  - directly predict recon. in feed-forward manner
    so, can learn strong inherent prior directly from data [7], [8], [9], [10], [11]
  - 근데 dynamic scene 모델링하기 복잡하고 4D supervision data 부족해서 적용하는데 한계 있었음
  - 지금 시점 기준 L4GM [11] 이 유일한 feed-forward dynamic recon. model인데,
    synthetic object-centric dataset으로 훈련돼서
    fixed camera view-point와 multi-view supervision을 필요로 한다는 한계와
    real-world scene에 generalize하기 어렵다는 한계가 있었음
Feed-Forward Dynamic scene recon. :
- 본 논문은
  pixel-aligned 3DGS [12] 를 기반으로
  novel BulletTimer and NTE module 제안

Dynamic 3D Representation :
- TBD
Novel-View-Synthesis :
- TBD
Feed-Forward Reconstruction :
- TBD

Overview

notation :
context frames \(I_{c} \subset I\)
camera poses \(P_{c} \subset P\)
context timestamps \(T_{c} \subset T\)
bullet timestamp \(t_{b} \in [\text{min}(T_{c}), \text{max}(T_{c})]\)
recon. at timestamp \(t \notin T\) by NTE module
Architecture :
- Training :
  BTimer와 NTE Module을 별도로 각각 train
  (not end-to-end)
- Inference :
  NTE Module로 pre-process한 뒤
  BTimer 사용

Method

BTimer (Bullet Timer)

Model Design :
- encode (input) :
  \(i\)-th frame \(I_{i} \in I_{c}\) 을 \(8 \times 8\) 짜리 patches로 나눈 뒤
  \(j\)-th patch에 대해
  per-patch input token \(f_{ij} |_{j=1}^{HW / 64} = f_{ij}^{rgb} + f_{ij}^{pose} + f_{i}^{time}\) 만든 뒤
  concatenate input tokens from all context frames
  and feed into Transformer
  - image encoder :
    GS-LRM [12] 에서 영감을 받아,
    ViT model을 backbone으로 사용
  - camera pose encoder :
    camera Plucker embedding [13]
  - time encoder :
    PE (Positional Encoding) 및 linear layer를 거쳐
    \(t_{i}\) 와 \(t_{b}\) 를 각각 \(f_{i}^{ctx}\) 와 \(f_{i}^{bullet}\) 으로 encode한 뒤
    \(f_{i}^{time} = f_{i}^{ctx} + f_{i}^{bullet}\)
    - context (input) timestamp \(t_{i}\) from context (input) frame \(I_{i}\)
    - bullet (target) timestamp \(t_{b}\) that is shared across context (input) frames
- decode (output) :
  transformer의 per-patch output token \(f_{ij}^{out}\) 을
  per-patch 3DGS param. at bullet timestamp \(G_{ij} \in R^{8 \times 8 \times 12}\) 로 regression
  - each Gaussian has 12 param. as color \(c \in R^{3}\), scale \(s \in R^{3}\), rotation unit quaternion \(q \in R^{4}\), opacity \(\sigma \in R\), and ray distance \(\tau \in R\)
  - 3D position is obtained by pixel-aligned unprojection \(\mu = o + \tau d\)
    (\(o\) and \(d\) are obtained from camera pose \(P_{i}\))
Loss :
\(L_{RGB} = L_{MSE} + \lambda L_{LPIPS}\) with \(\lambda = 0.5\)
Timestamp :
context (input) frames와 bullet (target supervision) frame 을 잘 고르는 게 중요
- In-context Supervision :
  - bullet timestamp is randomly selected from context frames
    \(t_{b} \in T_{c}\)
  - model이 context timestamp에 대해 정확히 recon. 가능하도록
- Interpolation Supervision :
  - bullet timestamp lies between two adjacent context frames
    \(t_{b} \notin T_{c}\)
  - model이 dynamic parts를 interpolate할 수 있도록
  - pixel-aligned 3DGS의 inductive bias 때문에
    motion이 복잡하고 빠를 때 intermediate timestamp에 대해 예측 잘 못 함
    \(\rightarrow\)
    먼저 NTE module의 도움을 받아 pre-process한 뒤
    BTimer 사용
  - local minimum 방지 및 view 간 consistency 상승
Inference :
- bullet (target) timestamp \(t_{b}\) 를 원하는 each timestamp in video로 설정하면
  full video를 recon.할 수 있으므로
  각 frame을 parallel하게 recon. 가능
- ???
  For a video longer than the number of training context views \(| I_{c} |\),
  at timestamp \(t\), apart from including this exact timestamp and setting \(t_{b} = t\),
  we uniformly distribute the remaining \(| I_{c} | − 1\) required context frames across the whole duration of the video
  to form the input batch with \(| I_{c} |\) frames

NTE Module (Novel Time Enhancer)

NTE Module Design :
decoder-only LVSM [13] 에서 영감을 받아,
BTimer model과 구조 똑같지만,
I/O가 다름
- input :
  - context frame :
    - context (input) timestamp embedding
      (BTimer model과 달리 bullet timestamp embedding은 안 넣음)
    - camera pose Plucker embedding
    - context (input) frame
  - intermediate frame :
    - bullet (target) timestamp embedding
    - target camera pose Plucker embedding
- output :
  - Transformer의 per-patch output tokens 중 target token을
    unpatchify and project directly to RGB values by linear layer
    \(\rightarrow\)
    RGB frame for any bullet (target) timestamp
    (this RGB frame은 NTE module network의 direct output이고, 3DGS로 rendering한 게 아님!!)
    \(\rightarrow\)
    NTE Module의 output은
    BTimer에서 bullet timestamp의 image로 쓰임
- Implementation :
  - LVSM [13] 에서처럼
    안정적인 훈련을 위해 QK-norm 사용
    (Q와 K의 내적 과정에서 값이 너무 크거나 작으면 gradient explode or vanish 발생할 수 있으므로
    Q와 K를 normalize)
  - target token에 attention할 수 있도록
    masked attention 사용
  - [14]
    에서처럼
    빠른 inference를 위해 KV-Cache 사용
    (training할 때는 전체 input sequence에 대해 K, Q, V를 계산하지만,
    inference할 때는 prev. token에서 계산한 K, V를 cache에 저장한 채 매번 Q만 새로 계산함으로써 input sequence 전체에 대해 K, V를 매번 계산할 필요 없어 계산 비용 감소)
  - NTE Module has negligible overhead on runtime
Loss :
\(L_{RGB} = L_{MSE} + \lambda L_{LPIPS}\) with \(\lambda = 0.5\)
BTimer and NTE Module :
- NTE Module로 직접 RGB image 예측하여
  novel-view-synthesis 할 수 있긴 한데, 그럼 성능 안 좋음
  (Ablation Study에 있음)
- feed-forward transformer (NTE Module)로 FPI pre-process한 뒤
  feed-forward transformer (BTimer)로 data info. 포착하여
  3DGS param. 예측한 뒤
  3DGS rasterization으로 novel-view-synthesis

Curriculum Training at Scale

Generalizability :
- data 다양성이 많을수록 model generalizability가 높아짐
  static dataset은 많이 존재하고
  dynamic dataset은 적게 존재하지만 motion awareness 및 temporal consistency 확보 가능
- 본 논문의 model인 BTimer는
  generalizable to both static and dynamic scenes
  - static scene : equalize all \(t_{b}\)
  - dynamic scene : recon. at arbitrary bullet \(t_{b}\)
  - different domain에서는 different model 필요로 하는
    GS-LRM [12] or MVSplat [8] 과는 다름
Curriculum Training :
- Stage 1) Low-res to High-res Static Pretraining
  - static dataset으로 pre-train
    - both synthetic and real-world
    - 390K training samples
    - normalize datasets to be bounded in \(10^{3}\) cube
    - 종류 :
      - Objaverse
      - RE10K
      - MVImgNet
      - DL3DV
  - no time embedding
    (static scene이니까)
  - data distribution이 복잡하기 때문에
    coarse 세팅 (low-resol.(\(128 \times 128\)) and few-view(\(| I_{c} | = 4\)))에서 시작해서
    점점 fine 세팅 (high-resol.(\(256 \times 256 \rightarrow 512 \times 512\)))으로 train
- Stage 2) Dynamic Scene Co-training
  - dynamic dataset으로 fine-tuning
    - 종류 :
      - Kubric
      - PointOdyssey
      - DynamicReplica
      - Spring
  - 4D dynamic dataset이 부족하기 때문에
    안정적인 훈련을 위해
    static dataset을 함께 사용하여 co-training
  - Internet video로부터 camera pose를 매기는 customized pipeline 구축하여
    real-world data에 대한 robustness 향상
    - 먼저 PANDA-70M dataset에서 random select한 video를 20s 길이의 clips로 자르기
    - SAM으로 video의 dynamic objects를 mask out
    - DROID-SLAM으로 video camera pose를 estimate
    - reprojection error 측정하여 low-quality의 video 및 pose는 필터링
    - 최종적으로 obtain 40K clips with high-quality camera trajectories
- Stage 3) Long-context Window Fine-tuning
  - NTE Module 말고 BTimer model에만 적용
  - context (input) image 수를
    \(| I_{c} | = 4\) 에서 \(| I_{c} | = 12\) 로 늘려서
    long video recon.하는 데 도움

Experiment

Implementation

Implementation :
- Backbone Transformer :
  FlashAttention-3 [15] and FlexAttention [16]
- 3DGS Rasterization :
  gsplat library [17]
- Training Schedule :
  - BTimer :
    totally 4 days on 64 A100 GPUs
    - Stage 1)
      \(128^{2}\) resol. 90K iter. init lr \(4 \times 10^{-4}\)
      \(\rightarrow\)
      \(256^{2}\) resol. 90K iter. init lr \(2 \times 10^{-4}\)
      \(\rightarrow\)
      \(512^{2}\) resol. 50K iter. init lr \(1 \times 10^{-4}\)
    - Stage 2)
      10K iter.
    - Stage 3)
      5K iter.
  - NTE Module :
    - Stage 1)
      \(128^{2}\) resol. 140K iter. init lr \(4 \times 10^{-4}\)
      \(\rightarrow\)
      \(256^{2}\) resol. 60K iter. init lr \(2 \times 10^{-4}\)
      \(\rightarrow\)
      \(512^{2}\) resol. 30K iter. init lr \(1 \times 10^{-4}\)
    - Stage 2)
      20K iter.
- Inference time :
  - BTimer :
    - 20 ms for 4-view \(256^{2}\) recon.
    - 150 ms for 12-view \(256^{2}\) recon.
    - 4.2 s for 12-view \(512 \times 896\) recon.
  - NTE :
    - 0.44 s for 4-view \(512 \times 896\) recon. w/o KV cache

Results

Dynamic Novel-View-Synthesis (Quantitative) :
- DyCheck Benchmark [18] :
  - dataset :
    DyCheck iPhone dataset (7 dynamic scenes by 3 synchronized cameras)
  - baseline :
    TiNeuVox, NSFF, T-NeRF, Nerfies, HyperNeRF, PGDVS, direct depth warp
    - BTimer는 per-scene optimization method에 competitive performance 달성
    - BTimer는 consistent depth estimate 없이도 PGDVS보다 성능 좋음
- NVIDIA Dynamic Scene Benchmark [19] :
  - dataset :
    NVIDIA Dynamic Scene dataset (9 dynamic scenes by 12 forward-facing synchronized cameras)
  - baseline :
    HyperNeRF, DynNeRF, NSFF, RoDynRF, MonoNeRF, 4D-GS, Casual-FVS
    - feed-forward 방식이므로 optimization time 필요 없음

Dynamic Novel-View-Synthesis (Qualitative) :
- test on real-world scene 위해
  DAVIS dataset의 monocular videos 이용하고,
  customized pipeline으로 camera pose estimate해서 사용

Static Novel-View-Synthesis :
- RealEstate10K Benchmark :
  - baseline : pixelSplat, MVSplat, GPNR, GS-LRM
- Tanks & Temples Benchmark :
  from InstantSplat Benchmark
  - baseline : GS-LRM (SOTA)

single dataset(Ours-Static)보다 mixed-dataset(Ours-Full) 사용하는 게 generalization 및 성능 훨씬 좋음

Ablation Study

Ablation 1) Context Frames :
- train할 때 context frames 더 많이 쓰면 3DGS prediction이 progressively 많아지므로 more complete scene recon. 가능
- inference할 때 서로 멀리 떨어진 context frames를 arbitrarily 골라서 커버하는 view 범위 넓힘

Ablation 2) Curriculum Training :
- Stage 1)
  single dataset 말고 multiple dataset 써야
  geometry와 sharp detail 잡는 데 도움
- Stage 2)
  static scene을 섞어서 co-train해야
  geometry 및 rich detail 잡는 데 도움

Ablation 3) Interpolation Supervision :
- temporal and multi-view consistency 챙기는 데 도움
- bullet timestamp가 input frames에 없을 때를 훈련하지 않으면
  white-edge artifacts 생김
  (interpolation loss를 cheat하려고 camera에 너무 가까운 3DGS를 만들기 때문)
Ablation 4) NTE Module :
- motion이 빠르고 복잡할 때 도움
  (ghosting artifacts 해소)
- BTimer 없이
  3D info. 쓰지 않는 NTE Module만으로 novel-view-synthesis 수행하면
  input camera trajectory와 먼 novel-view에 대해서는 잘 recon. 못 함

Conclusion

Limitation :
- geometry :
  - SOTA depth prediction model [20] 만큼 정확하게
    geometry (depth map)을 recover하지는 않음
- memory issue :
  transformer를 사용하다보니 memory 많이 소요
  - need 3 days on 64 A100 GPU (40GB VRAM)
  - up to \(512 \times 904\) spatial resol.
  - up to 12 context frames
- pose :
  - need camera pose param.
  - future work :
    DUSt3R, NoPoSplat처럼 pose-free일 순 없을까?
- non-generative :
  - 본 논문은 feed-forward Transformer 모델이고
    generative model이 아니기 때문에
    cannot generate unseen region
    (unseen view를 예측하는 view extrapolation 불가능)
  - future work :
    generative prior 사용하여 view extrapolation 수행
- novelty :
  - 사실 arbitrary bullet timestamp를 input token에 추가한 뒤
    모든 input frames를 transformer에 때려넣고 원하는 bullet timestamp에서의 frame을 뽑아내는 video interpolation 방식으로 보이고,
    다만 차이점은 각 frame image를 transformer output으로 구하는 게 아니라 각 frame의 3DGS param.를 transformer output으로 구하는 것이고..
    모델 자체의 novelty보다는 implementation을 잘 해서 결과 좋게 낸 것 같다..
  - (static, dynamic) data를 많이 쓰고 stage 별 training을 통해 높은 performance를 달성할 수 있었고
    feed-forward 방식을 통해 빠른 속도를 달성

Question

Q1 :
NTE Module이 마지막에 linear layer로 RGB value를 예측함으로써
pixel-space에서 RGB image at bullet timestamp 를 interpolate하고
이를 BTimer에 사용하는데,
latent-space에서 interpolation 다룬 뒤 BTimer에 넘기면 더 성능 좋아질 수 있지 않을까?
A1 :
TBD
Q2 :
NTE Module이 예측한 pixel-space RGB image가 BTimer의 input으로 들어가는데,
NTE Module output이 부정확하면 drift 연쇄적으로 BTimer의 결과에도 악영향 미칠 거 같아.
refinement, uncertainty(confidence) 등으로 NTE Module output의 부정확성을 감소시켜 성능 높일 수 있을까?
A2 :
TBD
Q3 :
limitation 중에 unseen view는 recon.하지 못한다는 게 있는데 (view extrapolation 불가능)
본 논문이 generalizability를 가진다는 말은
static and dynamic unseen dataset (scene)에 대응할 수 있어서인거지?
A3 :
TB
Q4 :
BTimer와 NTE Module을 각각 별도로 train하므로 not end-to-end인데
end-to-end training할 수는 없을까?
A4 :
TBD
Q5 :
pixelSplat에서는 a pair of images를 transformer의 input으로 넣어 둘의 관계를 파악하여 static scene recon.하는데
dynamic scene recon.에서는 motion 정보를 캡처해야 하기 때문에
transformer의 input으로 두 장이 아니라 여러 장의 image를 넣어주어야 하는거야?
A5 :
TBD
Q6 :
interpolation supervision으로 context (input) frame이 아닌 그 사이의 frame에 대해 rendering할 때 GT는 무엇으로 두나요?
A6 :
context (input) frame의 image와 camera pose를 직접 interpolate하여 사용 ???
Q7 :
다른 논문들을 보면 4DGS처럼 canonical time에 대한 시간에 따른 Gaussian 변화량을 MLP로 학습하거나,
또는 Dynamic Gaussian Marbles처럼 prev. frame의 GS가 next frame의 GS에 미치는 영향을 학습하기 위해 global adjustment해서 gaussian trajectory를 학습함으로써
GS끼리 정보를 주고받습니다.
본 논문에서는 모든 input frames를 BTimer에 parallel하게 넣어준 뒤 bullet (target) timestamp마다 3DGS param.를 따로 뽑아내는데
그럼 3DGS끼리는 정보를 공유하지 않는 건가요?
A7 :
네, 일단 BTimer 이 논문에서는 모든 input frames를 BTimer에 parallel하게 때려넣은 뒤 self-attention에 의존해서 t를 포함한 frames 간의 관계를 학습하는 것 같습니다.
Q8 :
어차피 3DGS끼리 정보를 공유하지 않는 거면 굳이 3DGS를 사용한 이유가 있나요?
A8 :
novel-view-synthesis task에서 novel camera pose에 대한 image를 뽑아내려면 3D info.를 이용해야 recon.이 잘 될 것이기 때문에 3DGS를 이용합니다.
논문에서 언급되어 있듯이 NTE Module만을 이용해서 from 2D to 2D로 novel-view-synthesis task를 수행하면 quality가 좋지 않았다고 합니다.
Q9 :
camera pose의 영향을 많이 받을 것 같아요. 만약에 input frame 3에서 보였던 물체가 frame 밖을 벗어나거나 occlusion 때문에 input frame 4에서 안 보이게 되었을 때에도 잘 recon.하려면 prev. frame의 3D info. 정보를 결합해서 반영하는 식이어야 할 것 같은데, 각 bullet timestamp의 3D info.끼리 어떻게 relate되는지에 대한 내용이 없으니까 이와 같은 상황에 잘 대응할 수 있는지 궁금합니다.
A9 :
TBD

Feed Forward Bullet Time Reconstruction

Feed Forward Bullet Time Reconstruction of Dynamic Scenes from Monocular Videos

Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, Jiahui Huang

Contribution

Introduction

Related Works

Overview

Method

BTimer (Bullet Timer)

NTE Module (Novel Time Enhancer)

Curriculum Training at Scale

Experiment

Implementation

Results

Ablation Study

Conclusion

Question