pixelSplat : reconstruct a 3DGS primitive-based parameterization of 3D radiance field from only two images
Introduction
Problem :
scale ambiguity : camera pose has arbitrary scale factor
local minima : primitive param.을 random initialization으로부터 직접 optimize하면 local minima 문제 발생
Contribution :
two-view image encoder : two-view Epipolar Sampling, Epipolar Attention 덕분에 scale ambiguity 문제 극복
pixel-aligned Gaussian param. prediction module : depth를 sampling하기 때문에 local minima 문제 극복
Solution :
feed-forward model 이, a pair of images로부터, 3DGS primitives로 parameterized되는 3D radiance field recon.을 학습
model :
per-scene model : 각각의 scene을 학습하기 위해 정해진 하나의 points set을 iteratively update \(\rightarrow\) local minima 등 문제 있어서 3DGS에서는 non-differentiable Adaptive Density Control 기법으로 해결하려 하지만 이는 일반화 불가능
feed-forward model : scene마다 얻은 points set을 한 번에 feed-forward로 넣어서 학습 differentiable하게 일반화 가능
attention
MASt3R(-SfM), Spann3R, Splatt3R, DUSt3R (잘 모름. 더 서치해봐야 함.)
Background
local minima :
언제 발생? : random location에 initialize된 Gaussian primitives가 최종 목적지까지 a few std보다 더 많이 움직여야 될 때 gradient가 vanish 되어버려서 또는 최종 목적지까지 loss가 monotonically decrease하지 않을 때 local minima 발생
해결법? : 3DGS에서는 non-differentiable pruning and splitting 기법인 Adaptive Density Control을 사용하지만 본 논문에서는 differentiable parameterization of Gaussian primitives 소개
Scale Ambiguity
Scale Ambiguity 문제 :
SfM 단계에서 camera pose를 계산할 때 real-world-scale pose \(T_{j}^{m}\) 을 metric pose \(s_{i} T_{j}^{m}\) 으로 scale하여 사용
single image의 camera pose \(s_{i} T_{j}^{m}\) 로부터 arbitrary scale factor \(s_{i}\) 를 복원하는 건 불가능
Scale Ambiguity 해결 :
two-view encoder에서 a pair of images 로부터 arbitrary scale factor \(s_{i}\) 복원
Two-view encoder Overview :
Step 1) Per-Image Encoder
Step 2) Epipolar Sampling
Step 3) Epipolar Attention
Step 1) Per-Image Encoder
Step 1) Per-Image Encoder : each view (two images)를 각각 feature \(F\), \(\tilde F\) 로 encode
Step 2) Epipolar Sampling
Step 2) Epipolar Sampling : Features 1 from Image 1의 ray로 query 만들고, Features 2 from Image 2의 epipolar samples 및 depth 로 key, value 만들어서, attention으로 depth scale을 잘 학습하는 게 목적 (attention : depth 정보와 함께, Image 1의 ray가 Image 2의 epipolar line 위 어떤 sample에 더 많이 attention하는지) (epipolar line은 학습하는 게 아니라 수학 식으로 계산)
Query : \(q = Q \cdot F [u]\) where \(F\) : Features 1 from Image 1 where \(F [u]\) : ray feature at each pixel (in pixel coordinate)
Key, Value : \(s = \tilde F [\tilde u_{l}] \oplus \gamma (\tilde d_{\tilde u_{l}})\) where \(\tilde F\) : Features 2 from Image 2 where \(\tilde F [\tilde u_{l}]\) : samples on epipolar line where \(\tilde d_{\tilde u_{l}}\) : Image 2의 camera 원점까지의 거리
\[k_{l} = K \cdot s\]
\[v_{l} = V \cdot s\]
Step 3) Epipolar Attention 중 Epipolar Cross-Attention
Step 3) Epipolar Attention :
Epipolar Cross-Attention : 앞서 만든 \(q, k_{l}, v_{l}\) 로 cross-attention 수행하여 per-pixel correpondence b.w. ray and epipolar sample 찾음으로써 per-pixel feature \(F [u]\) 가 이제 arbitrary scale factor \(s\) 에 consistent한 scaled depth를 encode하도록 update
\(F [u] += Att(q, k_{l}, v_{l})\) where \(+=\) : skip-connection where \(Att\) : softmax attention
Per-Image Self-Attention : Cross-Attention 끝난 뒤 마지막에 Per-Image Self-Attention 수행하여 propagate scaled depth estimates to parts of the image feature maps that may not have any epipolar correspondences
\[F += SelfAttention(F)\]
Gaussian Parameter Prediction
앞선 과정들 덕분에 scale-aware feature map \(F, \tilde F\) 를 이용하여 Gaussian param. \(g_{k} = (\mu_{k}, \Sigma_{k}, \alpha_{k}, S_{k})\) 를 예측
2D image 상의 모든 각 pixel은 3D 상의 point에 대응되어 최종적인 Gaussian primitives set은 just union of each image
3D position Prediction :
방법 1) Baseline : predict point estimate of 3D position \(\mu\)
\(\boldsymbol \mu = \boldsymbol o + d_{u} \boldsymbol d\) where \(u\) : 2D pixel coordiante where \(\boldsymbol d = R K^{-1} [u, 1]^{T}\) : ray direction where \(d_{u} = g_{\theta}(F [u])\) : depth obtained by neural network
문제 : depth 자체를 neural network로 추정하는 건 local minima 문제 발생하기 쉬움
방법 2) 본 논문 방식 : predict probability density of 3D position \(\mu\)
핵심 : neural network로 depth 자체를 예측하는 게 아니라 differentiable probability distribution of likelihood of depth along ray를 예측
Step 2) discrete probability \(\phi\) 로부터 index \(z\) 를 sampling \(z \sim p_{\phi}(z)\)
Step 3) ray를 쏴서(unproject) Gaussian mean \(\mu\) 계산 \(\boldsymbol \mu = \boldsymbol o + (b_{z} + \delta_{z}) \boldsymbol d\) where \(\phi\) : depth(\(z\)) probability obtained by neural network where \(\delta_{z}\) : depth offset obtained by neural network
Gaussian Parameter Prediction :
scale-aware feature map \(F, \tilde F\) 과 neural network \(f\) 를 이용하여 \(\phi, \delta, \Sigma, S = f(F [u])\) where \(\phi, \delta, \Sigma, S\) : depth probability, depth offset, covariance, spherical harmonics coeff.
3D position(mean) : \(\phi, \delta\) 이용해서 \(\boldsymbol \mu = \boldsymbol o + (b_{z} + \delta_{z}) \boldsymbol d\)
Covariance : \(\Sigma\)
Spherical Harmonics Coeff. : \(S\)
Opacity : \(\phi\) 이용해서 \(\alpha = \phi_{z}\) \(=\) probability of sampled depth \(z\) (so that we make sampling differentiable)
각 pixel마다 3DGS point에 대응되므로 pixel-aligned Gaussians라고 부름
Experiments
Setup :
Dataset : camera pose is computed by SfM
RealEstate 10k
ACID
Baseline :
pixelNeRF
GPNR
Method of Du et al.
Metric :
PSNR
SSIM
LPIPS
Result :
performance much better
inference time faster
less memory per ray
Ablation Study
Ablation Study :
Epipolar Encoder : Epipolar Sampling and Epipolar Attention