Real-time, High-resolution, and General Neural View Synthesis (SIGGRAPH 2024)
Quark - Real-time, High-resolution, and General Neural View Synthesis
John Flynn, Michael Broxton, Lukas Murmann, Lucy Chai, Matthew DuVall, Clément Godard, Kathryn Heal, Srinivas Kaza, Stephen Lombardi, Xuan Luo, Supreeth Achar, Kira Prabhu, Tiancheng Sun, Lynn Tsai, Ryan Overbeck
input : sparse multi-view images (\(\in R^{M \times H \times W \times 3}\)) (sensitive to view selection) (pose 정보 필요)
output : novel view image
Quark는 pretrained model (pretrained with 8 input views of scenes(Spaces, RFF, Nex-Shiny, and SWORD)) 가져와서 unseen scene에 대한 refinement로 novel target view synthesis 가능 (generalizable)
pre-trained model을 가져와서 학습하는데, layered depth map을 업데이트하는 방법은 gradient descent 이용한 fine-tuning이 아니라 input view feature 이용한 refinement임!!
U-Net skip-connection과 비슷하지만 Update & Fuse 단계가 novel (아래에서 별도로 설명)
Upsample & Activate :
image resolution으로 upsample한 뒤 Layered Depth Map at target view 구함
Depth \(d \in R^{L \times H \times W \times 1}\) (이 때, depth map은 linear in disparity (가까운 high-freq. 영역에서 더 촘촘히))
Opacity \(\sigma \in R^{L \times H \times W \times 1}\)
Blend Weights \(\beta \in R^{L \times H \times W \times M}\) by attention softmax weight
Rendering :
input images \(\in R^{M \times H \times W \times 3}\) 를 Layered Depth Map (target view)로 back-project한 뒤 Blend Weights \(\beta\) 로 input images를 blend해서 per-layer RGB 얻음
Opacity \(\sigma\) 로 per-layer RGB를 alpha-compositing해서 final RGB image at target view 얻고, Opacity \(\sigma\) 로 Depth \(d\) 를 alpha-compositing해서 Depth Map 얻음
training할 때는 stadard differentiable rendering 사용하지만 inference할 때는 1080p resol. at 1.3 ms per frame 위해 CUDA-optimized renderer 사용
Method
Update & Fuse :
Step 1) Render to Input Views
from layer space (target view) to image space (input view) (feature pyramid \(I_{\downarrow k}\) 와 합치기 위해!)
feature volume \(V^{(n)}\) \(\rightarrow\) obtain appearance \(a\), density \(\sigma\), depth map \(d\) (depth map \(d = \delta + \text{tanh}\) 는 depth anchor \(\delta\) 근처의 depth) \(\rightarrow\) project from target-view into input-view by \(P_{\theta}\) \(\rightarrow\) obtain rendered feature \(\tilde I\) by alpha-compositing \(O\) at input-view (\(\tilde I\) : intermediate LDM(layered depth map))
Step 2) Update Block
rendered feature \(\tilde I\) 를 feature pyramid \(I_{\downarrow k}\), input view-direction \(\gamma\) 등 input image에 대한 정보와 섞음
input view-direction 넣어줄 때 Ray Encoding \(\gamma\) 수행 :
obtain difference vector (아래 그림 참고) (input view가 target view에서 멀리 떨어져 있을수록 값이 큼) \(\rightarrow\) tanh and Sinusoidal PE
tanh 사용하므로 difference vector가 0 근처일 때 즉, input view가 target view 근처일 때 gradient 많이 반영
input view’s ray가 frustum 밖으로 벗어나더라도 near, far plane과의 교점을 구할 수 있으므로 Ray Encoding 가능
view-direction 넣어줘야 view-dependent color 만들 수 있고 reflection, non-lambertian surface 잘 구현 가능
Step 3) Back-project
from image space (input view) to layer space (target view) (feature volume \(V^{(n)}\) 과 합치기 위해!)
back-project from input-view into target-view by \(P_{\theta}^{T} (I, d)\) \(\rightarrow\) obtain residual feature \(\Delta\)
Step 4) One-to-Many Attention
feature volume \(V^{(n)}\) 을 query로, Step 1~3)에서 얻은 residual feature \(\Delta\) 를 key, value로 하여 One-to-Many attention 수행하여 updated feature volume \(V^{(n+1)}\) 얻음 Then, target view가 input view의 feature들을 aggregate하여 이용할 수 있게 됨!! 즉, target view가 어떤 input view에 얼만큼 attention해야 하는지!
query : target view 정보 at target view space
key, value : input view 정보 at target view space
One-to-Many attention :
cross-attention과 비슷하지만 redundant matrix multiplication 없애서 complexity 줄여서 real-time reconstruction에 기여!
\(\text{MultiHead}(Q, K, V) = \text{concat}(\text{head}_{1}, \cdots, \text{head}_{h}) W^{O}\) where \(\text{head}_{i} = \text{Attention}(QW_{i}^{Q}, KW_{i}^{K}, VW_{i}^{V})\) 식을 써보면 \(W_{i}^{Q} (W_{i}^{K})^{T}\) 항에서 \(W^{Q}\) 와 \(W^{K}\) 가 redundant 하고 \(\text{concat}(\cdots W_{i}^{V}) W^{O}\) 항에서 \(W^{V}\) 와 \(W^{O}\) 가 redundant 하므로 \(\text{head}_{i} = \text{Attention}(QW_{i}^{Q}, K, V)\) 로 바꿔서 \(W^{Q}\) 와 \(W^{O}\) 만 사용
difference vector for input view-direction (Ray Encoding)
Result
Training :
Dataset : Spaces, RFF, Nex-Shiny, SWORD
Loss : \(\text{10 * L1} + \text{LPIPS}\)
Input : 8 views (randomly sampled from 16 views nearest to object)
Inference time : recon.까지 포함해서 총 33ms at 1080p single A100 GPU
Generalizable method와의 비교
Non-Generalizable method와의 비교
Discussion
Limitation :
view selection : training할 때 sparse(8개) input views를 사용하는데, view selection에 매우 민감 (중요함) (heuristic)