NeRF

representing scenes as neural radiance fields for view synthesis

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Ben Mildenhall, Pratul P.Srinivasan, Matthew Tancik

paper :
https://arxiv.org/abs/2003.08934
project website :
https://www.matthewtancik.com/nerf
pytorch code :
https://github.com/yenchenlin/nerf-pytorch
https://github.com/csm-kr/nerf_pytorch?tab=readme-ov-file
tiny tensorflow code :
https://colab.research.google.com/github/bmild/nerf/blob/master/tiny_nerf.ipynb
referenced blog :
https://csm-kr.tistory.com/64
https://yconquesty.github.io/blog/ml/nerf/nerf_rendering.html#the-rendering-formula

코드 리뷰는 별도의 포스팅에 업로드하였습니다! Blog

핵심 요약 :

  1. 여러 각도의 camera center에서 each input image pixel 방향으로 ray(r=o+td)를 쏜다.
  2. ray를 discrete points로 sampling한다.
  3. 3D coordinate x와 viewing direction d를 r(x)와 r(d)로 positional encoding한다.
  4. r(x)를 MLP에 넣어 volume density를 얻고 여기에 r(d)까지 넣어 RGB color를 얻는다.
  5. coarse network와 fine network(hierarchical sampling) 각각에서 volume density와 color를 이용한 volume rendering으로 ray마다 rendering pixel color를 구한다.

Introduction

Pipeline

  1. input: single continuous 5D coordinate
    3D location \(x, y, z\)
    2D direction \(\theta, \phi\)
  2. output:
    volume density (differential opacity) (how much radiance is accumulated by a ray)
    view-dependent RGB color (emitted radiance) \(c = (r, g, b)\)
Pipeline of NeRF architecture

Problem & Solution

Problem :

  1. not sufficiently high-resolution representation
  2. inefficient in the number of samples per camera ray

Solution :

  1. input positional encoding for MLP to represent higher frequency function
  2. hierarchical sampling to reduce the number of queries

Contribution

Neural 3D shape representation

Limit : oversmoothed renderings, so limited to simple shapes with low geometric complexity

View synthesis and image-based rendering

Limit :
gradient-based optimization is often difficult because of local minima or discontinuities or poor loss landscape
mesh 구조를 유지하면서 gradient-based optimization하는 게 어렵
needs a template mesh with fixed topology for initialization, which is unavailable in real-world

Limit :
good results, but limited by poor time, space complexity due to discrete sampling
\(\rightarrow\) discrete sampling : rendering high resol. image => finer sampling of 3D space

Author’s solution :
encode continuous volume into network’s parameters
=> higher quality rendering + require only storage cost of those sampled volumetric representations

Neural Radiance Field Scene Representation

represent continuous scene by 5D MLP : (x, d) => (c, \(\sigma\))

Here, there are 2 key-points!

multiview consistent :
c is dependent on both x and d, but \(\sigma\) is only dependent on location x
3D coordinate x => 8 fc-layers => volume-density and 256-dim. feature vector

Lambertian reflection : diffuse(난반사) vs Specular reflection : 전반사

non-Lambertian effects : view-dependent color change to represent specular reflection
feature vector is concatenated with direction d => 1 fc-layer => view-dependent RGB color

Volume Rendering with Radiance Fields

Ray from input image (pre-processing)

We use Ray to synthesize continuous-viewpoint images from discrete input images

\(r(t) = o + td\)
o : camera’s center of projection
d : viewing direction
t \(\in [ t_n , t_f ]\) : distance from camera center b.w. camera’s predefined near and far planes

How to calculate viewing direction d??

  • 2D pixel-coordinate : \(\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\)
  • 2D normalized-coordinate (\(z = 1\)) by intrinsic matrix :
    \(\begin{bmatrix} u \\ v \\ 1 \end{bmatrix}\) = \(K^{-1}\) \(\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\) = \(\begin{bmatrix} 1/f_x & 0 & -\frac{1}{f_x}\frac{W}{2} \\ 0 & 1/f_y & -\frac{1}{f_y}\frac{H}{2} \\ 0 & 0 & 1 \end{bmatrix}\) \(\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\)
    Since \(y, z\) have opposite direction between the real-world coordinate and pixel coordinate, we multiply (-1)
    \(\begin{bmatrix} u \\ v \\ 1 \end{bmatrix}\) = \(\begin{bmatrix} 1/f_x & 0 & -\frac{1}{f_x}\frac{W}{2} \\ 0 & -1/f_y & \frac{1}{f_y}\frac{H}{2} \\ 0 & 0 & -1 \end{bmatrix}\) \(\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\)
    Here, focal length in intrinsic matrix K is usually calculated using camear angle \(\alpha\) as \(\tan{\alpha / 2} = \frac{h/2}{f}\)
  • 3D world-coordinate by extrinsic matrix :
    For extrinsic matrix \([R \vert t']\),
    \(o = t'\)
    \(d = R * \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}\)
    Therefore, we can obtain \(r(t) = o + td\)

Volume Rendering from MLP output

We use differential classical volume rendering

Let ray \(r\) (traced through desired virtual camera) have near and far bounds \(t_n, t_f\)
expected color of ray \(r\) = \(C(r) = \int_{t_n}^{t_f} T(t) \sigma (r(t)) c(r(t), d) dt\)

volume rendering 식 유도 과정

occluding objects are modeled as spherical particles with radius \(r\)
There are \(A \cdot \Delta z \cdot \rho (z)\)개의 particles in the slice where \(\rho (z)\) is particle density (the number of particles per unit volume)

Since solid particles do not overlap for \(\Delta z \rightarrow 0\),
\(A \cdot \Delta z \cdot \rho (z) \cdot \pi r^2\)만큼 area is occluded
즉, cross section \(A\)에서 \(\frac{A \cdot \Delta z \cdot \rho (z) \cdot \pi r^2}{A} = \pi r^2 \cdot \rho (z) \cdot \Delta z\)의 비율만큼 occluded

If \(\frac{A \cdot \Delta z \cdot \rho (z) \cdot \pi r^2}{A}\)만큼 rays are occluded, the light intensity decreases as
\(I(z + \Delta z) = (1 - \pi r^2 \rho (z) \Delta z) \times I(z)\)

Then the light density difference \(\Delta I = I(z + \Delta z) - I(z) = - \pi r^2 \rho (z) \Delta z \cdot I(z)\)
즉, \(dI(z) = - \pi r^2 \rho (z) I(z) dz = - \sigma (z) I(z) dz\)
where volume density (or opacity) is \(\sigma(z) = \pi r^2 \rho (z)\)
It makes sense because particle area와 particle density(particle 수)가 클수록 ray 감소량 (volume density)이 커지기 때문
ODE 풀면, \(I(z) = I(z_0)\exp(- \int_{z_0}^{z} \sigma (r(s)) ds)\)

Let’s define transmittance \(T(z) = \exp(- \int_{z_0}^{z} \sigma (r(s)) ds)\)
where \(I(z) = I(z_0)T(z)\) means the remainning intensity after rays travel from \(z_0\) to \(z\)
where transmittance \(T(z)\) means CDF that a ray does not hit any particles from \(z_0\) to \(z\)

If a ray passes empty space, there is no color
If a ray hits particles, there exists color (radiance is emitted)
Let’s define \(H(z) = 1 - T(z)\), which means CDF that a ray hits particles from \(z_0\) to \(z\)
CDF를 미분하면 PDF이므로
Then PDF is \(p_{hit}(z) = \frac{dH}{dz} = - \frac{dT}{dz} = \exp(- \int_{z_0}^{z} \sigma (r(s)) ds) \sigma (z) = T(z) \sigma (z)\)

Let a random variable \(R\) be the emitted randiance.
Then PDF \(p_R(ray) = P[R = c(z)] = p_{hit}(z) = T(z) \sigma (z)\)
Then the color of a pixel is expected radiance for ray bounded from \(t_n\) to \(t_f\)
\(C(ray) = E[R] = \int_{t_n}^{t_f} R \cdot p_R dz = \int_{t_n}^{t_f} c \cdot p_{hit} dz = \int_{t_n}^{t_f} T(z) \sigma (z) c(z) dz\)

\(t_n, t_f = 0., 1.\) for scaled-bounded and front-facing scenes after conversion to NDC (normalized device coordinates)
NDC에 대한 설명은 따로 정리한 블로그 글 How NDC Works? 참고

To apply the equation to our model by numerical quadrature,
we have to sample discrete points from continuous ray

Instead of deterministic quadrature(typically used for rendering voxel grids, but may limit resolution),
author divides a ray \(\left[t_n, t_f\right]\) into N = 64 bins(intervals), and chooses one point \(t_i\) for each bin by uniform sampling
\(t_i\) ~ \(U \left[t_n + \frac{i-1}{N}(t_f - t_n), t_n + \frac{i}{N}(t_f - t_n)\right]\)

Although we use discrete N samples, stratified sampling(층화 표집) enables MLP to be evaluated at continuous positions by optimization

Discretized version for N samples by Numerical Quadrature :
expected color \(\hat{C}(r) = \sum_{i=1}^{N} T_i (1 - \exp(- \sigma_{i} \delta_{i})) c_i\)

또는

\(p_{hit}(z_i) = \frac{dH}{dz} |_{z_i} ~~ \rightarrow ~~ H(z_{i+1}) - H(z_i) = (1 - T(z_{i+1})) - (1 - T(z_i)) = T(z_i) - T(z_{i+1}) = e^{- \sum_{j=1}^{i-1} \sigma_{j} \delta_{j}} - e^{- \sum_{j=1}^{i} \sigma_{j} \delta_{j}} = T(z_i)(1 - e^{- \sigma_{i} \delta_{i}})\)
Then the color of a pixel is expected radiance for ray bounded from \(t_n\) to \(t_f\)
\(\hat{C}(ray) = E[R] = \int_{t_n}^{t_f} R \cdot p_R dz ~~ \rightarrow ~~ \sum_{i=1}^{N} c_i \cdot p_{hit}(z_i) dz = \sum_{i=1}^{N} c_i T_i (1 - \exp(- \sigma_{i} \delta_{i}))\)

Final version :
expected color \(\hat{C}(r) = \sum_{i=1}^{N} T_i \alpha_{i} c_i\)
where \(T_i = \prod_{j=1}^{i-1} (1-\alpha_{j})\) and \(\alpha_{i} = 1 - \exp(-\sigma_{i} \delta_{i})\)
which reduces to traditional alpha-compositing problem

이 때, this volume rendering 식은 differentiable하므로 end-to-end learning 가능!!
a sequence of samples \(\boldsymbol t = {t_1, t_2, \ldots, t_N}\)에 대해
\(\frac{d\hat{C}}{dc_i} |_{\boldsymbol t} = T_i \alpha_{i}\) \(\frac{d\hat{C}}{d \sigma_{i}} |_{\boldsymbol t} = c_i \times (\frac{dT_i}{d \sigma_{i}} \alpha_{i} + \frac{d \alpha_{i}}{d \sigma_{i}} T_i) = c_i \times (0 + \delta_{i}e^{-\sigma_{i}\delta_{i}} T_i) = \delta_{i} T_i c_i e^{- \sigma_{i} \delta_{i}}\)

alpha-compositing :
여러 frame을 합쳐서 하나의 image로 합성하는 과정으로, 각 frame pixel마다 alpha 값(불투명도 값)(0~1)이 있어 겹치는 부분의 pixel 값을 결정

By divide-and-conquer approach (tail recursion),
\(c = \alpha_{1}c_{1} + (1 - \alpha_{1})(\alpha_{2}c_{2} + (1 - \alpha_{2})(\cdots)) = \alpha_{1}c_{1} + (1 - \alpha_{1})\alpha_{2}c_{2} + (1 - \alpha_{1})(1 - \alpha_{2})(\cdots) = \cdots = \sum_{i=1}^{N}(\alpha_{i}c_{i}\prod_{j=1}^{i-1}(1-\alpha_{j}))\) where \(\alpha_{0} = 0\)

If \(\alpha_{i} = 1 - \exp(-\sigma_{i} \delta_{i})\),
NeRF volume rendering 식 \(\hat{C}(r) = \sum_{i=1}^{N} T_i \alpha_{i} c_i\)과
alpha-compositing 식 \(c = \sum_{i=1}^{N}(\alpha_{i}c_{i}\prod_{j=1}^{i-1}(1-\alpha_{j}))\)은
SAME!!

Optimizing a Neural Radiance Field

Positional encoding (pre-processing)

kernel regression(dot product 및 더하기)을 사용하는 MLP의 특성상
If we use input directly, MLP is biased to learn low-frequency function (oversmoothed appearance) (no detail) (spectral bias)
So, low dim. input의 작은 변화에 대해 급격하게 변화하는 high-frequency output은 학습 잘 못함

Here, fourier features (sinusoidal signal은 input signal을 orthogonal space에서 표현 가능) let MLP learn high-frequency function in low-dim. domain [1]
If we map input into higher dim. space which contains both low and high frequency info. by fourier features, MLP can fit data with high-frequency variation
Due to positional encoding, MLP can behave as interpolation function where \(L\) determines the bandwidth of the interpolation kernel [1]
\(r : R \rightarrow R^{2L}\)
\(r(p) = (sin(2^0\pi p), cos(2^0\pi p), \cdots, sin(2^{L-1}\pi p), cos(2^{L-1}\pi p))\)
\(L=10\) for \(r(x)\) where x has three coordinates
\(L=4\) for \(r(d)\) where d has three components of the cartesian viewing direction unit vector

추가로, low-dim. input 정보를 high-frequency output에 반영하기 위해서는 kernel을 거친 뒤에도 orthogonal eigenvalue들이 많이 살아있어야 하는데 stationary kernel 또는 Spherical Harmonics가 이러한 역할 수행 가능

Hierarchical volume sampling

Densely evaluating N points by stratified sampling is inefficient
=> We don’t need much sampling at free space or occluded regions
=> Hierarchical sampling enables us to allocate more samples to regions we expect to contain visible content

We simultaneously optimize 2 networks with different sampling : coarse and fine

coarse sampling \(N_c\)개 : 위에서 배웠던 내용
author divides a ray into \(N_c\) = 64 bins(intervals), and chooses one point \(t_i\) for each bin by uniform sampling
\(t_i\) ~ \(U \left[t_n + \frac{i-1}{N_c}(t_f - t_n), t_n + \frac{i}{N_c}(t_f - t_n)\right]\)

fine sampling \(N_f\)개 : 새로운 내용
coarse sampling model’s output is a weighted sum of all coarse-sampled colors
\(\hat{C}(r) = \sum_{i=1}^{N_c} T_i \alpha_{i} c_i = \sum_{i=1}^{N_c} w_i c_i\)
where we define \(w_i = T_i \alpha_{i} = T_i (1 - \exp(-\sigma_{i} \delta_{i}))\) for \(i=1,\cdots,N_c\)
=> Given the output of coarse network, we try more informed (better) sampling where samples are biased toward the relevant parts of the scene volume
=> We sample \(N_f\)=128 fine points following a piecewise-constant PDF of normalized \(\frac{w_i}{\sum_{j=1}^{N_c} w_j}\)
=> Here, we use Inverse CDF Method for sampling fine points

Inverse transform sampling = Inverse CDF Method :
=> PDF (probability density function) : \(f_X(x)\)
=> CDF (cumulative distribution function) : \(F_X(x) = P(X \leq x) = \int_{-\infty}^x f_X(x) dx\)
idea : 모든 확률 분포의 CDF는 Uniform distribution을 따른다!
=> 따라서 CDF의 y값을 Uniform sampling하면, 그 y에 대응되는 x에 대한 (특정 PDF를 따르는) sampling을 구현할 수 있다!
=> 즉, CDF의 역함수를 계산할 수 있다면, 기본 난수 생성기인 Uniform sampling을 이용해서 확률 분포 X에 대한 sampling을 할 수 있다!
\(F_X(x)\) ~ \(U\left[0, 1\right]\)
\(X\) ~ \(F^{-1}(U\left[0, 1\right])\)
=> 이 때, CDF의 기울기가 가파를수록 그 곳에 물체가 있을 확률이 높다는 의미이며 Inverse CDF Method를 통해 더 많이 sampling됨

We evaluate coarse network using \(N_c\)=64 points per ray
We evaluate fine network using \(N_c+N_f\)=64+128=192 points per ray where we sample 128 fine points following PDF of coarse sampled points
In result, we use a total of 64+192=256 samples per ray to compute the final rendering color \(C(r)\)

Implementation details & Loss

  1. Prepare RGB images, corresponding camera poses, intrinsic parameters and scene bounds (use COLMAP structure-from-motion package to estimate these parameters)
  2. From H x W input image, randomly sample a batch of 4096 pixels
  3. calculate continuous ray from each pixel \(r(t) = o + kd\)
  4. coarse sampling of \(N_c\)=64 points per each ray \(t_i\) ~ \(U \left[t_n + \frac{i-1}{N_c}(t_f - t_n), t_n + \frac{i}{N_c}(t_f - t_n)\right]\)
  5. positional encoding \(r(x)\) and \(r(d)\) for input
  6. obtain volume density \(\sigma\) by MLP with \(r(x, y, z)\) as input
  7. obtain color \(c\) by MLP with \(r(x, y, z)\) and \(r(\theta, \phi)\) as input
  8. obtain rendering color of each ray by volume rendering \(\hat{C}(r) = \sum_{i=1}^{N} T_i (1 - \exp(- \sigma_{i} \delta_{i})) c_i\) from two networks ‘coarse’ and ‘fine’
  9. compute loss
  10. Adam optimizer with learning rate from \(5 \times 10^{-4}\) to \(5 \times 10^{-5}\)
  11. optimization for a single scene typically takes around 100-300k iterations (1~2 days) to converge on a single NVIDIA V100 GPU

Here, we use L2 norm for loss
\(L = \sum_{r \in R} \left[{\left\|\hat{C_c}(r)-C(r)\right\|}^2+{\left\|\hat{C_f}(r)-C(r)\right\|}^2\right]\)
\(C(r)\) : GT pixel RGB color
\(\hat{C_c}(r)\) : rendering RGB color from coarse network : to allocate better samples in fine network
\(\hat{C_f}(r)\) : rendering RGB color from fine network : our goal
\(R\) : the set of all pixels(rays) across all images

Results

Datasets

Synthetic renderings of objects

Real images of complex scenes

Measurement

Comparisons

Test for scenes from author's new synthetic dataset

LLFF exhibits banding and ghosting artifacts
SRN produces blurry and distorted renderings
NV cannot capture the details
NeRF captures fine details in both geometry and appearance

Test for read-world scenes

LLFF may have repeated edges because of blending between multiple renderings
NeRF also correctly reconstruct partially-occluded regions
SRN does not capture any high-frequency fine detail

Discussion

Ablation Studies

Conclusion

prior : MLP outputs discretized voxel representations
author : MLP outputs volume density and view-dependent emitted radiance

Future Work

efficiency :
Rather than hierarchical sampling, there is still much more progress to be made for efficient optimization and rendering of neural radiance fields

interpretability :
voxel grids or meshes admits reasoning about the expected quality, but it is unclear to analyze these issues when we encode scenes into the weights of MLP