Difix3D+

Improving 3D Reconstructions with Single-Step Diffusion Models (CVPR 2025)

Difix3D+ - Improving 3D Reconstructions with Single-Step Diffusion Models (CVPR 2025)

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling

paper :
https://arxiv.org/abs/2503.01774
project website :
https://research.nvidia.com/labs/toronto-ai/difix3d/

핵심 :
nearly real-time single-step 2D diffusion model을
3D artifacts removing task에 맞게 fine-tune한 뒤,
3D model에 distill하여 progressively update하거나
post-processing으로 씀!

Contribution

Difix3D+ 설명 :
- Stage 1)
  sparse view로 학습된 못난 3D model로 rendering한 novel-view의 artifacts를 제거하도록
  single-step 2D diffusion model (Difix)을 minimal fine-tuning
- Stage 2)
  fine-tuned Difix를 이용하여
  3D model로 distill하거나 post-processing
  - Difix 역할 1) reconstruction at training phase :
    fine-tuned Difix를 적용하여 clean pseudo-training views (training set)을 iteratively augment함으로써
    fine-tuned Difix의 prior를 3D model로 distill
  - Difix 역할 2) neural enhancer at inference phase :
    single-step 2D diffusion model은 inference speed 빠르므로
    남은 residual artifacts를 제거하기 위해 improved recon. output에 직접 Difix 적용
    (near real-time post-processing)
Difix3D+ Contribution :
- single-step diffusion model :
  ADD (Adversarial Diffusion Distillation)으로 학습된 single-step diffusion model은
  single-step만 U-Net을 query하므로 inference speed 빠름
  - ADD (Adversarial Diffusion Distillation) Link Link :
    SDS loss + GAN loss
  - DMD (Distribution Matching Distillation) Link
- minimal fine-tuning :
  그래서 fine-tuning하는 데 single GPU로 only a few hours만 필요
- general model :
  모든 3D models (NeRF, 3DGS 등)에 사용 가능한 single model
- metrics :
  - artifacts 제거(fix)하므로 3D consistency 유지한 채 FID score 2배 이상 및 PSNR 1dB 이상 향상
  - single-step diffusion model을 사용하므로 multi-step standard diffusion model보다 10배 이상 빠름

Introduction

현재 NVS methods의 한계 :
per-scene optimization framework
- input이 sparse하거나
  input camera poses로부터 먼 extreme novel view를 rendering하거나
  varying lighting conditions 또는 imperfect camera poses 상황에서
  spurious(가짜) geometry 또는 missing regions 등 artifacts 생김
- underlying geometry를 제대로 반영하지 못한 채 (incorrect shape)
  inherent smoothness에만 의존하는 3D representation으로도
  training images’ radiance에 overfitting되는
  shape-radiance ambiguity 문제
large 2D generative model :
- 2D diffusion model prior :
  large internet-scale data를 학습하여 real-world images의 distribution을 잘 이해하고 있으므로 diffusion priors는 여러 분야에 generalize 가능
  - inpainting [1] [2] [3]
  - outpainting [4] [5] [6]
- 2D diffusion model prior to 3D :
  - multi-step으로 U-Net을 query [7] [8] [9] [10] :
    object-centric scenes를 optimize하고, more expansive camera trajectories를 가진 larger env.로 scale하는 데 사용
    But,, time-consuming!
  - single-step만 U-Net을 query [11] [12] [13] [14] [15] [16] :
    inference speed 빠르고,
    minimal fine-tuning만으로도 extreme novel-view에서도 NeRF/3DGS rendering의 artifacts를 “fix”할 수 있음!

3D recon. 개선 :
imperfect noisy input data에 대응하기 위해
- optimize camera poses [17] [18] [19] [20] [21] [22]
- lighting variations 고려 [23] [24] [25]
- transient occlusions 완화 [26]
- 위 방법들의 한계 :
  완전히 artifacts를 해결하진 못함
Priors for NVS :
under-observed (잘 보지 못한) 영역들을 잘 recon.하지 못하는 문제를 해결하기 위해
- Geometric priors :
  noise에 민감하고, dense input일 때만 미미한 개선
  - by regularization term [27] [28] [29]
  - by pre-trained models which provide depth GT [30] [31] [32] [33] and normal GT [34]
- 여러 scenes’ data로 feed-forward neural network 훈련 :
  이웃한 reference views의 정보를 aggregate하여 reference views 근처에서는 잘 수행하지만, [35] [36] [37] [38] [39]
  rendering 분포가 inherently multi-mode를 가지는 ambiguous regions에서는 잘 못 함
Generative Priors for NVS :
- GAN :
  NeRF 개선하기 위해 per-scene GAN 훈련 [40]
- Diffusion :
  - diffusion model이 직접 novel view를 generate by minimal fine-tuning [41] [42] [43] [44] :
    - 단점 : 3D model을 사용하지 않으므로 generative nature leads to multi-view geometric inconsistency across different frames/poses, especially in under-observed and noisy regions
  - diffusion model이 3D model의 optimization을 guide [45] [46] [47] [48] [49]
    - 단점 : multi-step으로 U-Net을 query (denoise)해야 해서 훈련 많이 느림
  - diffusion model로 training image set을 augment하여 3D model을 fine-tuning [11] [50] [51] :
    (위의 두 방법을 합친 느낌?!)
    - 본 논문 : 어설픈 3D model의 output에 대해 2D diffusion model (U-Net)을 먼저 fine-tuning한 뒤
      diffusion model로 training image set을 augment하여 diffusion prior를 3D model로 distill (fine-tuning)하거나
      diffusion model로 post-processing

Overall Pipeline

Architecture :
- Stage 1)
  single-step 2D diffusion model (Difix)을 minimal fine-tuning
- Stage 2)
  - Step 1)
    Difix로 clean pseudo-training views (training set) 얻음 (augment!)
    이 때, novel camera pose는 reference view부터 target view까지의 경로를 따라 pose interpolation으로 얻음
  - Step 2)
    cleaned novel view를 이용하여 diffusion prior를 3D model로 distill
    - Step 1, 2를 반복하여 progressively update 3D representation
      (recon.하는 공간 크기를 키우고 diffusion model의 strong conditioning을 보장하기 위해)
  - Step 3)
    final updated 3D representation으로 rendering한 output을
    Difix로 real-time post-processing

Difix - from a Pretrained Diffusion Model to a 3D Artifact Fixer

based on SD-Turbo [14] :
SD-Turbo는 single-step diffusion model이라서 image-to-image translation task [52] 를 effectively 수행 가능

Fine Tuning

위의 Overall Pipeline에서 Stage 1)에 해당되는 내용!

Fine Tuning :
single-step 2D diffusion model인 SD-Turbo [14] 가 3D Artifact Fixer 역할을 할 수 있도록
Pix2pix-Turbo [52] 와 유사한 방식으로 fine-tune
- I/O :
  - input : clean reference view \(I_{ref}\) and degraded rendered target view \(\tilde I\)
  - output : clean reference view \(I_{ref}\) and clean target view \(\hat I\)
    (training 및 inference에는 clean reference view \(I_{ref}\) 만 사용)
- architecture :
  frozen VAE encoder - U-Net - reference mixing layer (self-attention layer) - LoRA fine-tuned decoder
  - frozen VAE encoder :
    reference view \(I_{ref}\) 와 degraded target novel view \(\tilde I\) 를 latent space로 encode한 뒤 concat
    as \(\epsilon (\tilde I, I_{ref}) = \boldsymbol z \in R^{(B \ast V) \times H \times W \times C}\)
    where \(V\) : the number of views (reference views and target views)
    where \(C\) : the number of latent channels
  - U-Net
  - reference-view conditioning by reference mixing layer (self-attention layer) :
    \(\boldsymbol z \in R^{B \times (V \ast H \ast W) \times C}\) 로 reshape한 뒤
    \(V \ast H \ast W\) dimension (dim=1)에 대해 self-attention layer를 적용하여
    reference view 간의 cross-view consistency를 포착
    (특히 rendered target novel view의 퀄리티가 좋지 않을 때
    함께 input으로 넣어주는 clean reference view로부터 objects, color, texture 등 key info.를 잘 포착할 수 있음!)
  - LoRA fine-tuned decoder
- 디테일 :
  lower noise level (e.g. \(\tau = 200\) instead of \(\tau = 1000\)) 부여
  s.t. diffusion model generates less-hallucinated results (덜 상상하며 생성)
- 아이디어 :
  a specific noise level \(\tau\) 에서
  3D model이 rendering한, artifacts를 가진 images의 분포는
  원래 diffusion model을 train하는 데 사용한, Gaussian noise를 가진 images의 분포와 유사할 것이다!
- loss :
  \(L = L_{Recon} + L_{LPIPS} + 0.5 L_{Gram}\)
  - \(L_{Recon}\) : L2 loss
  - \(L_{LPIPS}\) : perceptual loss (VGG-16 features끼리 비교)
  - \(L_{Gram} = \frac{1}{L} \sum_{l=1}^{L} \beta_{l} \| G_{l}(\hat I) - G_{l}(I) \|_{2}\) :
    style loss for sharper detail (VGG-16 features의 Gram matrix끼리 비교)
    where \(G_{l}(I) = \phi_{l}(I)^{T} \phi_{l}(I)\)
    where \(\hat I\) (rendered noisy image) - \(I\) (GT clean image) pair는 아래에서 설명할 Data Curation 방법으로 생성

noise level이 높으면 model은 artifacts를 잘 제거하지만 image context도 함께 바꿈,, noise level이 낮으면 model은 image를 거의 안 건드림

Data Curation

Data Curation :
single-step 2D diffusion model SD-Turbo 를 fine-tuning하기 위해
loss \(L = L_{Recon} + L_{LPIPS} + 0.5 L_{Gram}\) 를 사용하려면
novel-view synthesis artifacts를 가진 image \(\tilde I\) 와
이에 대응되는 깨끗해진 GT image \(I\) 를
pair로 가진 large dataset를 구축해야 함
- 방법 1) sparse reconstruction strategy :
  every \(n\)-th frame을 3D model training에 사용한 뒤
  나머지 frames를 GT로 삼고, 해당 pose에 대해 novel-view-synthesis 수행하여 \(I\) - \(\tilde I\) pair 구축
  - DL3DV dataset 처럼 camera trajectory가 있어서 novel views를 띄엄띄엄 sampling한 경우에는 잘 적용됨
  - MipNeRF360 또는 LLFF dataset 처럼 training에 사용할 every \(n\)-th frame이 거의 같은 영역을 보고 있는 경우에는 최적의 방법이 아님
    \(\rightarrow\) 그럼 training sample 수를 최대한 늘리려면 어떻게 할까?
    \(\rightarrow\) 아래의 방법들 사용
- 방법 2) cycle reconstruction :
  Internal RDS (real driving scene) dataset 처럼 거의 linear trajectory인 경우
  original trajectory를 \(T_{o}\)라 하고, 이를 horizontally 1-6 m 옮긴 trajectory를 \(T_{s}\) 라 하면
  NeRF-1을 \(I\) on \(T_{o}\) 에 대해 학습한 뒤 \(T_{s}\) 에 대해 NeRF-1을 rendering한 뒤
  these rendered views on \(T_{s}\) 에 대해 NeRF-2를 학습한 뒤 \(T_{o}\) 에 대해 NeRF-2를 rendering한 걸 \(\tilde I\) 로 사용
- 방법 3) model underfitting :
  artifacts 더 많은 \(\tilde I\) 를 generate하기 위해
  기존 training schedule epoch 수의 25%-75% 정도로만 훈련시켜서
  render \(\tilde I\) with underfitted recon.
- 방법 4) cross reference :
  multi-camera dataset의 경우에는
  그 중 하나의 camera로만 3D model을 학습시킨 뒤
  나머지 camera view \(I\) 에 해당되는 pose에 대해 render \(\tilde I\)

Difix3D+ - NVS with Diffusion Priors

Difix3D - Progressive 3D Update

위의 Overall Pipeline에서 Stage 2)의 Step 1), 2)에 해당되는 내용!

Progressive 3D Update and Reference-View Conditioning :
- desired novel-view trajectory가 input views로부터 너무 멀다면
  diffusion model의 reference-view conditioning signal이 약해서
  diffusion model은 더 상상하며(hallucinate) degraded rendered novel view를 깨끗하게 만듦
- Instruct-NeRF2NeRF와 비슷하게 iterative training scheme을 사용하여
  the set of 3D cues를 progressively 늘려가면서
  diffusion model의 reference-view conditioning 을 증가시키면
  diffusion model은 self-attention layer에서 clean reference view로부터 key info.를 많이 얻을 수 있음
Strategy :
- 처음에 sparse reference views만으로 optimize 3D model
- 1.5k iter.마다 GT novel camera pose를 조금씩 perturb 하여
  (by reference view부터 target view까지의 경로를 따라 pose interpolation)
  3D model로 novel view를 rendering한 뒤
  fine-tuned 2D diffusion model로 refine
  - the refined clean novel-view images를 다음 1.5k iter.에서 training set에 더함
  - sparse reference views 뿐만 아니라 the refined clean novel views까지 training set으로 사용하므로 3D model의 consistency와 quality가 좋아짐!
    (3D model로 distill)
- 위의 과정을 반복하면서 reference views와 target views 사이의 3D cues’ overlap을 progressively 증가시켜서
  target view에서의 artifact-free rendering을 보장할 수 있게 됨!

Difix3D+ - Real time Post Render Processing

위의 Overall Pipeline에서 Stage 2)의 Step 3)에 해당되는 내용!

diffusion prior를 3D model로 distill하더라도
3D recon. model의 limited capacity로 인해
under-observed regions에서는 sharp detail을 살리지 못함
\(\rightarrow\)
fine-tuned diffusion model (Difix)을 적용하여 final post-processing at render time 함으로써
consistency를 유지하면서 novel-view를 enhance
- fine-tuned diffusion model (Difix)는 single-step diffusion model이므로
  이로 인한 부가적인 rendering time은 only 76 ms on NVIDIA A100 GPU

Experiment

In-the-wild Artifact Removal

In-the-wild Artifact Removal :
DL3DV dataset and Nerfbusters dataset 사용
- Difix Training :
  - DL3DV dataset의 경우 scenes의 80% (112 scenes)를 randomly select
  - Data Curation Strategy로 80,000개의 noisy-clean image pair 만들어서 Difix 훈련
- Evaluation :
  - DL3DV dataset의 경우 나머지 20% (28 scenes) 이용
    reference view와 target view 간에 상당한 차이가 있도록
    camera position에 따라 reference view와 target view를 분류
  - Nerfbusters dataset의 경우 12 captures 이용
    Nerfbusters [47] 의 recommended protocol을 따라
    reference view와 target view를 분류
- Baseline :
  - Nerfbusters [47] :
    NeRF의 artifacts를 제거하기 위해 3D diffusion model 사용
  - GANeRF [40] :
    NeRF의 artifacts 제거하기 위해 per-scene GAN 훈련
  - NeRFLiX [35] :
    feed-forward network를 사용하여 근처 reference views의 정보를 aggregate하여 퀄리티 향상
  - 3DGS-based methods :
    gsplat library Link 사용
- Metric : PSNR, SSIM, LPIPS, FID

Automotive Scene Enhancement

Automotive Scene Enhancement :
RDS (real driving scene) dataset 사용
(multi-view dataset이라서 서로 40도의 overlaps를 가지고 있는 세 개의 cameras 있음)
- Difix Training :
  - 세 개의 cameras 중 center camera를 reference 및 target view로 사용하고,
    40 scenes 사용하여 훈련
  - Data Curation Strategy로 100,000개의 noisy-clean image pair 만들어서 Difix 훈련
- Evaluation :
  - 나머지 두 개의 cameras를 novel view로 사용하고,
    20 scenes 사용하여 평가

Ablation Study

Ablation Study on Pipeline :
하나씩 요소를 추가해가며 진행!
- (a) 3D model update 없이
  rendered views에 그냥 직접 Difix 적용
  - reference views에서 먼 less observed regions에서는 별로고, consistency 유지하지 못해서 flickering 발생
- (b) non-incremental (not progressive) 3D model update
  (pseudo-views를 한 번에 모두 training set에 추가하여 3D model update(distill))
- (c) progressive 3D model update
- (d) post-rendering으로도 Difix 적용

Ablation Study on Difix Training :
- low noise value \(\tau = 200\) 을 사용한 Difix3D+
  versus high noise value \(\tau = 1000\) 을 사용한 Pix2pix-Turbo [52]
  - high noise value 를 사용할 경우
    diffusion model이 more hallucinated (더 상상하며) GT와 다른 결과를 generate하기 때문에
    poorer generalization on the test dataset
- reference view conditioning (self-attention layer)의 유무
- Gram style loss의 유무

Conclusion

Limitation :
- inherently limited by initial 3D model의 성능
Future Work :
- initial 3D model의 성능에 의존하는 문제를
  modern diffusion priors 사용하여 극복
- Difix는 single-step 2D diffusion model을 fine-tuning한 model인데,
  single-step video diffusion model로 확장하여
  long-context 3D consistency까지도 향상

Question

Q1 :
single-step 2D diffusion model인 Difix는
novel-view rendering이나 camera pose selection에 관여하지 않고
그냥 더러운 image를 깨끗하게 만드는 역할일 뿐인건가요?
A1 :
맞는 말씀인데, 깨끗하게 만듦으로써 distillation으로 3D model param.를 업데이트하는 역할도 있습니다.
다시 한 번 설명해보도록 하겠습니다.
Difix는 더러운 image를 깨끗하게 만들 수 있도록 fine-tuning 되었습니다.
근데, 더러운 image를 깨끗하게 만듦으로써 두 가지 역할을 수행합니다.
- 첫 번째는 3D model이 render한 더러운 image를 깨끗하게 만들어서 이를 다시 training image로 사용하는 과정에서 diffusion prior를 3D model로 distill하여 3D model을 update하는 데 사용할 수 있습니다.
- 두 번째는 final updated 3D model이 render한 image를 마지막으로 좀 더 깨끗하게 만드는 post-processing에 사용합니다.
Q2 :
한 번에 3D model을 update하는 non-incremental 방법에 비해
progressively 3D model을 update하는 방법의 성능이 더 좋은 이유는 무엇이라고 생각하시나요?
A2 :
핵심은
reference view부터 target view까지의 경로를 따라 novel camera pose를 조금씩 perturb한다는 것과,
Difix의 self-attention layer에 있다고 생각합니다.
- progressive update의 경우에는
  먼저 reference view와 가까이 있는 novel view에서 더러운 image를 render한 뒤 이를 Difix에 넣어서
  self-attention layer에 의해 가까운 clean reference view의 도움으로 더러운 novel view를 쉽게 깨끗하게 만들 수 있습니다.
  그리고 깨끗해진 novel view를 다시 training set (reference view)에 넣고, novel view를 reference view부터 target view까지 조금씩만 이동시키므로
  마찬가지로 그 다음 novel view에 대해서도 nearby clean reference view의 도움으로 계속해서 novel view를 쉽게 깨끗하게 만들 수 있습니다.
- non-incremental update의 경우에는
  reference view부터 target view까지의 경로에 있는 임의의 모든 novel views에 대해 더러운 image를 한꺼번에 render한 뒤 이를 한꺼번에 Difix에 넣어서 한꺼번에 깨끗하게 만들어야 하는데,
  만약 reference view로부터 멀리 있는 novel view의 경우
  둘의 3D content 차이가 커서 self-attention layer가 둘의 관계를 잘 포착할 수 없고 clean reference view의 도움으로 novel view를 깨끗하게 만들기가 쉽지 않습니다.