representing scenes as neural radiance fields for view synthesis
paper :
https://arxiv.org/abs/2003.08934
project website :
https://www.matthewtancik.com/nerf
pytorch code :
https://github.com/yenchenlin/nerf-pytorch
https://github.com/csm-kr/nerf_pytorch?tab=readme-ov-file
tiny tensorflow code :
https://colab.research.google.com/github/bmild/nerf/blob/master/tiny_nerf.ipynb
referenced blog :
https://csm-kr.tistory.com/64
https://yconquesty.github.io/blog/ml/nerf/nerf_rendering.html#the-rendering-formula
코드 리뷰는 별도의 포스팅에 업로드하였습니다! Blog
핵심 요약 :
- 여러 각도의 camera center에서 each input image pixel 방향으로 ray(r=o+td)를 쏜다.
- ray를 discrete points로 sampling한다.
- 3D coordinate x와 viewing direction d를 r(x)와 r(d)로 positional encoding한다.
- r(x)를 MLP에 넣어 volume density를 얻고 여기에 r(d)까지 넣어 RGB color를 얻는다.
- coarse network와 fine network(hierarchical sampling) 각각에서 volume density와 color를 이용한 volume rendering으로 ray마다 rendering pixel color를 구한다.
(a) march camera rays and generate sampling of 5D coordinates
(b) represents volumetric static scene by optimizing continuous 5D function(fully-connected network)
5D coordinate
volume density
(differential opacity) (how much radiance is accumulated by a ray)view-dependent RGB color
(emitted radiance) \(c = (r, g, b)\)(c) synthesizes novel view by classic volume rendering
techniques(differentiable) to accumulate(project)(composite) the color/density samples into 2D image along rays
(d) loss between synthesized and GT observed images
Problem :
Solution :
positional encoding
for MLP to represent higher frequency functionhierarchical sampling
to reduce the number of queriesdeep networks that map \(xyz\) coordinates to signed distance functions or occupancy fields
=> limit : need GT 3D geometry
Niemeyer et al.
=> input : find directly surface intersection for each ray (can calculate exact derivatie)
=> output : diffuse color at each ray intersection location
Sitzmann et al.
=> input : each 3D coordinate
=> output : feature vector and RGB color at each 3D coordinate
=> rendering by RNN that marches along each ray to decide where the surface is.
Limit : oversmoothed renderings, so limited to simple shapes with low geometric complexity
Given dense sampling of views
, novel view synthesis is possible by simple light field sample interpolation
Given sparser sampling of views
, there are 2 ways :
mesh-based representation and volumetric representation
Mesh-based
representation with either diffuse(난반사) or view-dependent appearance :
Directly optimize mesh representations by differentiable rasterizers or pathtracers so that we reproject and reconstruct images
Limit :
gradient-based optimization is often difficult because oflocal minima or discontinuities or poor loss landscape
mesh 구조를 유지하면서gradient-based optimization하는 게 어렵
needs atemplate mesh
with fixed topology for initialization, which is unavailable in real-world
Volumetric
representation :Limit :
good results, but limited by poor time, space complexity due to discrete sampling
\(\rightarrow\) discrete sampling : rendering high resol. image => finer sampling of 3D space
Author’s solution :
encodecontinuous
volume into network’s parameters
=> higher quality rendering + require only storage cost of thosesampled
volumetric representations
represent continuous scene by 5D MLP : (x, d) => (c, \(\sigma\))
Here, there are 2 key-points!
multiview consistent :
c is dependent on both x and d, but \(\sigma\) is only dependent on location x
3D coordinate x => 8 fc-layers => volume-density and 256-dim. feature vector
Lambertian reflection : diffuse(난반사) vs Specular reflection : 전반사
non-Lambertian effects : view-dependent color change to represent specular reflection
feature vector is concatenated with direction d => 1 fc-layer => view-dependent RGB color
We use Ray
to synthesize continuous
-viewpoint images from discrete
input images
\(r(t) = o + td\)
o : camera’s center of projection
d : viewing direction
t \(\in [ t_n , t_f ]\) : distance from camera center b.w. camera’s predefined near and far planes
How to calculate viewing direction d??
- 2D pixel-coordinate : \(\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\)
- 2D normalized-coordinate (\(z = 1\)) by intrinsic matrix :
\(\begin{bmatrix} u \\ v \\ 1 \end{bmatrix}\) = \(K^{-1}\) \(\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\) = \(\begin{bmatrix} 1/f_x & 0 & -\frac{1}{f_x}\frac{W}{2} \\ 0 & 1/f_y & -\frac{1}{f_y}\frac{H}{2} \\ 0 & 0 & 1 \end{bmatrix}\) \(\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\)
Since \(y, z\) have opposite direction between the real-world coordinate and pixel coordinate, we multiply (-1)
\(\begin{bmatrix} u \\ v \\ 1 \end{bmatrix}\) = \(\begin{bmatrix} 1/f_x & 0 & -\frac{1}{f_x}\frac{W}{2} \\ 0 & -1/f_y & \frac{1}{f_y}\frac{H}{2} \\ 0 & 0 & -1 \end{bmatrix}\) \(\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\)
Here, focal length in intrinsic matrix K is usually calculated using camear angle \(\alpha\) as \(\tan{\alpha / 2} = \frac{h/2}{f}\)- 3D world-coordinate by extrinsic matrix :
For extrinsic matrix \([R \vert t']\),
\(o = t'\)
\(d = R * \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}\)
Therefore, we can obtain \(r(t) = o + td\)
We use differential classical volume rendering
Let ray \(r\) (traced through desired virtual camera) have near and far bounds \(t_n, t_f\)
expected color of ray \(r\) = \(C(r) = \int_{t_n}^{t_f} T(t) \sigma (r(t)) c(r(t), d) dt\)
transmittance
CDF
that a ray does not hit
any particles from \(t_n\) to \(t\)volume density
along the ray (learned by MLP)opacity
= 불투명도 = extinction coefficient
= alpha value
for alpha-compositingcolor
along the ray (learned by MLP)volume rendering 식 유도 과정
occluding objects are modeled as spherical particles with radius \(r\)
There are \(A \cdot \Delta z \cdot \rho (z)\)개의 particles in the slice where \(\rho (z)\) is particle density (the number of particles per unit volume)
Since solid particles do not overlap for \(\Delta z \rightarrow 0\),
\(A \cdot \Delta z \cdot \rho (z) \cdot \pi r^2\)만큼 area is occluded
즉, cross section \(A\)에서 \(\frac{A \cdot \Delta z \cdot \rho (z) \cdot \pi r^2}{A} = \pi r^2 \cdot \rho (z) \cdot \Delta z\)의 비율만큼 occluded
If \(\frac{A \cdot \Delta z \cdot \rho (z) \cdot \pi r^2}{A}\)만큼 rays are occluded, the light intensity decreases as
\(I(z + \Delta z) = (1 - \pi r^2 \rho (z) \Delta z) \times I(z)\)
Then the light density difference \(\Delta I = I(z + \Delta z) - I(z) = - \pi r^2 \rho (z) \Delta z \cdot I(z)\)
즉, \(dI(z) = - \pi r^2 \rho (z) I(z) dz = - \sigma (z) I(z) dz\)
where volume density (or opacity)
is \(\sigma(z) = \pi r^2 \rho (z)\)
It makes sense because particle area와 particle density(particle 수)가 클수록 ray 감소량 (volume density)이 커지기 때문
ODE 풀면, \(I(z) = I(z_0)\exp(- \int_{z_0}^{z} \sigma (r(s)) ds)\)
Let’s define transmittance \(T(z) = \exp(- \int_{z_0}^{z} \sigma (r(s)) ds)\)
where \(I(z) = I(z_0)T(z)\) means the remainning
intensity after rays travel from \(z_0\) to \(z\)
where transmittance
\(T(z)\) means CDF
that a ray does not hit
any particles from \(z_0\) to \(z\)
If a ray passes empty space, there is no color
If a ray hits particles, there exists color (radiance is emitted
)
Let’s define \(H(z) = 1 - T(z)\), which means CDF that a ray hits
particles from \(z_0\) to \(z\)
CDF를 미분하면 PDF이므로
Then PDF is \(p_{hit}(z) = \frac{dH}{dz} = - \frac{dT}{dz} = \exp(- \int_{z_0}^{z} \sigma (r(s)) ds) \sigma (z) = T(z) \sigma (z)\)
Let a random variable \(R\) be the emitted randiance.
Then PDF \(p_R(ray) = P[R = c(z)] = p_{hit}(z) = T(z) \sigma (z)\)
Then the color of a pixel is expected radiance for ray
bounded from \(t_n\) to \(t_f\)
\(C(ray) = E[R] = \int_{t_n}^{t_f} R \cdot p_R dz = \int_{t_n}^{t_f} c \cdot p_{hit} dz = \int_{t_n}^{t_f} T(z) \sigma (z) c(z) dz\)
\(t_n, t_f = 0., 1.\) for scaled-bounded and front-facing scenes after conversion to NDC (normalized device coordinates)
NDC에 대한 설명은 따로 정리한 블로그 글 How NDC Works? 참고
To apply the equation to our model by numerical quadrature,
we have to sample discrete points from continuous ray
Instead of deterministic quadrature(typically used for rendering voxel grids, but may limit resolution),
author divides a ray \(\left[t_n, t_f\right]\) into N = 64 bins(intervals), and chooses one point \(t_i\) for each bin by uniform sampling
\(t_i\) ~ \(U \left[t_n + \frac{i-1}{N}(t_f - t_n), t_n + \frac{i}{N}(t_f - t_n)\right]\)
Although we use discrete N samples, stratified sampling(층화 표집)
enables MLP to be evaluated at continuous positions by optimization
Discretized version for N samples by Numerical Quadrature :
expected color \(\hat{C}(r) = \sum_{i=1}^{N} T_i (1 - \exp(- \sigma_{i} \delta_{i})) c_i\)
또는
\(p_{hit}(z_i) = \frac{dH}{dz} |_{z_i} ~~ \rightarrow ~~ H(z_{i+1}) - H(z_i) = (1 - T(z_{i+1})) - (1 - T(z_i)) = T(z_i) - T(z_{i+1}) = e^{- \sum_{j=1}^{i-1} \sigma_{j} \delta_{j}} - e^{- \sum_{j=1}^{i} \sigma_{j} \delta_{j}} = T(z_i)(1 - e^{- \sigma_{i} \delta_{i}})\)
Then the color of a pixel is expected radiance for ray
bounded from \(t_n\) to \(t_f\)
\(\hat{C}(ray) = E[R] = \int_{t_n}^{t_f} R \cdot p_R dz ~~ \rightarrow ~~ \sum_{i=1}^{N} c_i \cdot p_{hit}(z_i) dz = \sum_{i=1}^{N} c_i T_i (1 - \exp(- \sigma_{i} \delta_{i}))\)
Final version
:
expected color \(\hat{C}(r) = \sum_{i=1}^{N} T_i \alpha_{i} c_i\)
where \(T_i = \prod_{j=1}^{i-1} (1-\alpha_{j})\) and \(\alpha_{i} = 1 - \exp(-\sigma_{i} \delta_{i})\)
which reduces to traditionalalpha-compositing
problem
이 때, this volume rendering 식은 differentiable
하므로 end-to-end learning 가능!!
a sequence of samples \(\boldsymbol t = {t_1, t_2, \ldots, t_N}\)에 대해
\(\frac{d\hat{C}}{dc_i} |_{\boldsymbol t} = T_i \alpha_{i}\) \(\frac{d\hat{C}}{d \sigma_{i}} |_{\boldsymbol t} = c_i \times (\frac{dT_i}{d \sigma_{i}} \alpha_{i} + \frac{d \alpha_{i}}{d \sigma_{i}} T_i) = c_i \times (0 + \delta_{i}e^{-\sigma_{i}\delta_{i}} T_i) = \delta_{i} T_i c_i e^{- \sigma_{i} \delta_{i}}\)
alpha-compositing :
여러 frame을 합쳐서 하나의 image로 합성하는 과정으로, 각 frame pixel마다 alpha 값(불투명도 값)(0~1)이 있어 겹치는 부분의 pixel 값을 결정
By divide-and-conquer approach (tail recursion),
\(c = \alpha_{1}c_{1} + (1 - \alpha_{1})(\alpha_{2}c_{2} + (1 - \alpha_{2})(\cdots)) = \alpha_{1}c_{1} + (1 - \alpha_{1})\alpha_{2}c_{2} + (1 - \alpha_{1})(1 - \alpha_{2})(\cdots) = \cdots = \sum_{i=1}^{N}(\alpha_{i}c_{i}\prod_{j=1}^{i-1}(1-\alpha_{j}))\) where \(\alpha_{0} = 0\)
If \(\alpha_{i} = 1 - \exp(-\sigma_{i} \delta_{i})\),
NeRF volume rendering 식 \(\hat{C}(r) = \sum_{i=1}^{N} T_i \alpha_{i} c_i\)과
alpha-compositing 식 \(c = \sum_{i=1}^{N}(\alpha_{i}c_{i}\prod_{j=1}^{i-1}(1-\alpha_{j}))\)은
SAME!!
kernel regression(dot product 및 더하기)을 사용하는 MLP의 특성상
If we use input directly, MLP is biased to learn low-frequency function
(oversmoothed appearance) (no detail) (spectral bias)
So, low dim. input의 작은 변화에 대해 급격하게 변화하는 high-frequency output은 학습 잘 못함
Here, fourier features
(sinusoidal signal은 input signal을 orthogonal space에서 표현 가능) let MLP learn high-frequency function in low-dim. domain
If we map input into higher dim.
space which contains both low and high frequency info. by fourier features, MLP can fit data with high-frequency variation
Due to positional encoding, MLP can behave as interpolation function
where \(L\) determines the bandwidth of the interpolation kernel
\(r : R \rightarrow R^{2L}\)
\(r(p) = (sin(2^0\pi p), cos(2^0\pi p), \cdots, sin(2^{L-1}\pi p), cos(2^{L-1}\pi p))\)
\(L=10\) for \(r(x)\) where x has three coordinates
\(L=4\) for \(r(d)\) where d has three components of the cartesian viewing direction unit vector
추가로, low-dim. input 정보를 high-frequency output에 반영하기 위해서는 kernel을 거친 뒤에도 orthogonal eigenvalue들이 많이 살아있어야 하는데 stationary kernel 또는 Spherical Harmonics가 이러한 역할 수행 가능
Densely evaluating N points by stratified sampling is inefficient
=> We don’t need much sampling at free space or occluded regions
=> Hierarchical sampling enables us to allocate more samples to regions we expect to contain visible content
We simultaneously optimize 2 networks with different sampling : coarse
and fine
coarse sampling \(N_c\)개 : 위에서 배웠던 내용
author divides a ray into \(N_c\) = 64 bins(intervals), and chooses one point \(t_i\) for each bin by uniform sampling
\(t_i\) ~ \(U \left[t_n + \frac{i-1}{N_c}(t_f - t_n), t_n + \frac{i}{N_c}(t_f - t_n)\right]\)
fine sampling \(N_f\)개 : 새로운 내용
coarse sampling model’s output is aweighted sum of all coarse-sampled colors
\(\hat{C}(r) = \sum_{i=1}^{N_c} T_i \alpha_{i} c_i = \sum_{i=1}^{N_c} w_i c_i\)
where we define \(w_i = T_i \alpha_{i} = T_i (1 - \exp(-\sigma_{i} \delta_{i}))\) for \(i=1,\cdots,N_c\)
=> Given the output ofcoarse
network, we try more informed (better) sampling where samples are biased toward therelevant parts of the scene volume
=> We sample \(N_f\)=128fine
points following apiecewise-constant PDF
of normalized \(\frac{w_i}{\sum_{j=1}^{N_c} w_j}\)
=> Here, we useInverse CDF Method
for sampling fine points
Inverse transform sampling = Inverse CDF Method :
=> PDF (probability density function) : \(f_X(x)\)
=> CDF (cumulative distribution function) : \(F_X(x) = P(X \leq x) = \int_{-\infty}^x f_X(x) dx\)
idea : 모든 확률 분포의 CDF는 Uniform distribution을 따른다!
=> 따라서 CDF의 y값을 Uniform sampling하면, 그 y에 대응되는 x에 대한 (특정 PDF를 따르는) sampling을 구현할 수 있다!
=> 즉, CDF의 역함수를 계산할 수 있다면, 기본 난수 생성기인 Uniform sampling을 이용해서 확률 분포 X에 대한 sampling을 할 수 있다!
\(F_X(x)\) ~ \(U\left[0, 1\right]\)
\(X\) ~ \(F^{-1}(U\left[0, 1\right])\)
=> 이 때, CDF의 기울기가 가파를수록 그 곳에 물체가 있을 확률이 높다는 의미이며 Inverse CDF Method를 통해 더 많이 sampling됨
We evaluate coarse
network using \(N_c\)=64 points per ray
We evaluate fine
network using \(N_c+N_f\)=64+128=192 points per ray where we sample 128 fine points following PDF of coarse
sampled points
In result, we use a total of 64+192=256 samples per ray to compute the final rendering color \(C(r)\)
Here, we use L2 norm for loss
\(L = \sum_{r \in R} \left[{\left\|\hat{C_c}(r)-C(r)\right\|}^2+{\left\|\hat{C_f}(r)-C(r)\right\|}^2\right]\)
\(C(r)\) : GT pixel RGB color
\(\hat{C_c}(r)\) : rendering RGB color from coarse network : to allocate better samples in fine network
\(\hat{C_f}(r)\) : rendering RGB color from fine network : our goal
\(R\) : the set of all pixels(rays) across all images
Synthetic renderings of objects
Real images of complex scenes
LLFF exhibits banding and ghosting artifacts
SRN produces blurry and distorted renderings
NV cannot capture the details
NeRF captures fine details in both geometry and appearance
LLFF may have repeated edges because of blending between multiple renderings
NeRF also correctly reconstruct partially-occluded regions
SRN does not capture any high-frequency fine detail
prior : MLP outputs discretized voxel representations
author : MLP outputs volume density and view-dependent emitted radiance
efficiency :
Rather than hierarchical sampling, there is still much more progress to be made for efficient optimization and rendering of neural radiance fields
interpretability :
voxel grids or meshes admits reasoning about the expected quality, but it is unclear to analyze these issues when we encode scenes into the weights of MLP