NeRF-Code

NeRF Code Review

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Ben Mildenhall, Pratul P.Srinivasan, Matthew Tancik

paper :
https://arxiv.org/abs/2003.08934
project website :
https://www.matthewtancik.com/nerf
pytorch code :
https://github.com/yenchenlin/nerf-pytorch
https://github.com/csm-kr/nerf_pytorch?tab=readme-ov-file
tiny tensorflow code :
https://colab.research.google.com/github/bmild/nerf/blob/master/tiny_nerf.ipynb
Overview image reference :
https://yconquesty.github.io/blog/ml/nerf/nerf_ndc.html#dataflow

NeRF code는 빠른 실행을 위해 lower-level framework인 jax와 jit compile로 짜여진 버전도 있는데,
본 포스팅에서는 좀 더 익숙한 numpy, Pytorch framework로 코드 리뷰를 진행하였다

Train Code Flow Overview

Load Data

load data :
- load_llff.py
- load_blender.py
- load_LINEMOD.py
- load_deepvoxels.py

load_llff_data()

LLFF dataset : real dataset
return images, poses, bds, render_poses, i_test
- images : np (N, H, W, C)
- poses : np (N, 3, 5)
  camera poses
  poses[:, 0:3, 0:3] : 3-by-3 rotation matrix
  poses[:, 0:3, 3:4] : 3-by-1 translation matrix
  poses[:, 0:3, 4:5] : H, W, focal-length for intrinsic matrix
- bds : np (N, 2)
  scene bounds
  dim=1 : 2 = 1(near bound) + 1(far bound)
- render_poses : np (M, 3, 5)
  dim=0 : the number of generated poses for novel view synthesis
  generate new pose along sphere or spiral path
- i_test : int
  index of holdout-view (avg pose랑 가장 비슷한 pose를 갖는 view)
  training에서 제외하여 test할 때 사용
- near, far = 0., 1. if ndc is true else near, far = 0.9 * bds.min(), 1. * bds.max()

load_blender_data()

Blender dataset : synthetic dataset
return images, poses, render_poses, hwf, i_split
- images : np (N, H, W, C)
  blender dataset은 RGB-A channel을 가지고 있어 C = 4
- i_train, i_val, i_test = i_split
- near, far = 2., 6.
  (blender synthetic dataset은 통제된 환경에서 수집된 data이므로 ndc 사용하지 않고 frustum의 near, far plane 고정)
- 투명한 배경을 흰 배경으로 만들려면
  RGB * opacity + (1 - opacity) 를 통해
  RGB 값을 opacity만큼 반영하고 opacity가 작을수록(투명할수록) 색상이 흰색(1.)에 가까워지도록 함
  images = images[…,:3]*images[…,-1:] + (1.-images[…,-1:])
- 그냥 투명한 배경 그대로 쓰려면
  RGB-A channel에서 RGB channel만 가져와서 씀
  images = images[…,:3]

load_LINEMOD_data()

LINEMOD dataset : real dataset
return images, poses, render_poses, hwf, K, i_split, near, far

load_dv_data()

Deepvoxels dataset : synthetic dataset
return images, poses, render_poses, hwf, i_split
- near, far = hemi_R - 1., hemi_R + 1.
  where hemi_R = np.mean(np.linalg.norm(poses[:,:3,-1], axis=-1))
  camera center들로 이루어진 반구의 평균 반지름

Create NeRF Model

args.N_importance : fine-MLP에서 추가적으로 사용할 fine-sample 개수
- args.N_importance > 0 : fine-MLP 사용함
- args.N_importance <= 0 : fine-MLP 사용 안함
network_query_fn : 추후에 run_network() 사용하기 위한 함수
- input : position info., view-direction info., model
- output : model output

render_kwargs_train : dict for rendering
- network_query_fn : 추후에 run_network() 사용하기 위한 함수
- perturb : 일반화 위해 stratified ray-sampling할 때 randomness 추가할지 여부
  (test할 때는 False)
- network_fine, network_fn : fine-MLP, coarse-MLP
- N_importance, N_samples : number of fine-sampling, coarse-sampling
- white_bkgd : rendering에서 alpha-channel 사용할 때 투명한 부분이 흰색으로 채워지도록 할지 여부
- raw_noise_std : regularize(artifacts 완화) 위해 raw2ouputs()에서 model output 중 opacity에 추가할 noise의 std값
  (test할 때는 0.)
- lindisp :
  - NDC를 사용하는 front-unbounded llff dataset의 경우 lindisp = False로 설정하여
    linearly sampling in depth, 즉 depth를 균등하게 sampling하여
    먼 거리의 scene도 적절히 표현
  - NDC를 사용하지 않는 나머지 dataset의 경우 lindisp = True로 설정하여
    linearly sampling in inverse-depth, 즉 가까운 depth를 더 많이 sampling하여
    가까운 scene의 디테일을 잘 포착

Positional Encoding

get_embedder() input :
PE freq. 개수 \(L\) 과 PE 쓸지말지 여부
get_embedder() output :
PE-function과 PE 결과의 dim.

self.embed_fns :
각 frequency(\(0 \sim 2^{L-1}\))와 각 period function(\(sin, cos\))에 대한
list of lambda functions
\([sin(2^0x), cos(2^0x), \ldots sin(2^{L-1}x), cos(2^{L-1}x)]\)
Embedder.embed(x) :
self.embed_fns의 각 PE-function을 input x에 적용하여 dim=-1에 대해 concat

NeRF model

input_ch : position info. dim. : 3
input_ch_views : view-direction info. dim. : 3
use_viewdirs : MLP input으로 view-direction info.를 사용할지 말지 여부
(view-direction info.를 사용하면 RGB color 계산에 도움됨)
output_ch : output(RGB, opacity) dim. : 4 use_viewdirs가 False일 때만 사용하는 값

input x를 position info.와 view-direction info.로 쪼갬
self.use_viewdirs가 True일 때(view-direction info. 사용할 때) :
position info.만 넣어서 opacity를 뽑은 뒤
view-direction info.를 추가로 넣어서 RGB 뽑고
dim=-1에 대해 concat
self.use_viewdirs가 False일 때(view-direction info. 사용 안 할 때) :
position info.만 넣어서 output_ch만큼 한 번에 뽑음

run_network

flatten position and flatten view-direction \(\rightarrow\) each positional encoding and concat \(\rightarrow\) batchify model and apply model \(\rightarrow\) reshape again output

batchify

input이 주어지면 chunk만큼씩 쪼개서 적용하는 model 반환

Get Ray with batch

rays : shape (N, 2, H, W, 3)
- dim=1 : rays_o, rays_d
- dim=2, 3 : for H*W개의 pixels
- dim=4 : 3d
rays_rgb : shape (N, 3, H, W, 3) after concat with images
- dim=1 : rays_o, rays_d, images
rays_rgb : shape (N, H, W, 3, 3) \(\rightarrow\) (N_train, H, W, 3, 3) \(\rightarrow\) (N_train * H * W, 3, 3) \(\rightarrow\) shuffle along dim=0
- dim=0 : the number of rays(pixels)
- dim=1 : rays_o, rays_d, images
- dim=2 : 3d for rays and rgb for images

batch : N_train * H * W 개의 ray를 batch size = N_rand-개씩 묶어서 전부 사용
shape (N_train * H * W, 3, 3) \(\rightarrow\) shape (N_rand, 3, 3) \(\rightarrow\) (3, N_rand, 3)
batch_rays : shape (2, N_rand, 3)
- dim=0 : rays_o, rays_d
- dim=1 : the number of rays
target_s : shape (N_rand, 3)
- dim=0 : the number of pixels
- dim=1 : target pixel RGB
shuffle rays_rgb by torch.randperm() for every epoch

get_rays_np

parameter :
K : intrinsic matrix of shape (3, 3)
c2w : extrinsic matrix of shape (3, 4)
line 1 :
np.meshgrid([0, …, W-1], [0, …, H-1], indexing=’xy’)
- indexing=’xy’ : 첫 번째 array를 row-방향으로 반복하고, 두 번째 array를 column-방향으로 반복
- i, j : both shape (H, W) : 2D-pixel-coordinate (x, y)
line 2 :
- apply intrinsic matrix
  NeRF-Blog 의 Ray from input image (pre-processing) 참고
- dirs : shape (H, W, 3) : 2D-normalized-coordinate
line 4 :
- apply extrinsic matrix to calculate ray-direction
- dirs[…, np.newaxis, :] : shape (H, W, 1, 3) \(\rightarrow\) (H, W, 3, 3) by broad-casting
- c2w[:3, :3] : shape (3, 3) \(\rightarrow\) (H, W, 3, 3) by broad-casting
- ray_d : shape (H, W, 3)
  “elementwise-multiplication 후 sum”은 “matrix-multiplication”과 동일한 계산

* 오타 정정 : 2. matrix multiplication에서 [u, v, 1; u, v, 1; u, v, 1] 대신 [u; v; 1]

line 6 :
- apply extrinsic matrix to calculate ray-origin
- rays_o : shape (3,) \(\rightarrow\) (H, W, 3) by broad-casting

Get Ray without batch

차이점 :
Get Ray with batch에서는 N_train * H * W 개의 ray를 batch size = N_rand-개씩 묶어서 전부 사용했다면
Get Ray without batch에서는 N_train 중 training view 하나를 randomly 고른 뒤 H * W 개의 ray 중 N_rand-개를 randomly 골라서 사용
target : shape (N, H, W, C) \(\rightarrow\) (H, W, C)
\(\rightarrow\) target_s : shape (N_rand, C)
coords : H * W 개의 ray를 H-axis와 W-axis에서 인덱싱하기 위해 meshgrid of shape (H, W, 2) 생성
- 초반부 iter. : 중심부 crop해서 meshgrid of shape (2 * dH, 2 * dW, 2) 생성
- 후반부 iter. : meshgrid of shape (H, W, 2) 생성
- dim=2 : coords[:, :, 0]은 H-coord이고, coords[:, :, 1]은 W-coord
select_coords : shape (N_rand, 2)
H * W 개의 ray 중 N_rand-개를 randomly 고름
batch_rays : shape (2, N_rand, 3)
target_s : shape (N_rand, 3)

Render

input :
- chunk : 동시에 처리할 수 있는 최대 ray 수 (due to maximum memory usage)
- c2w_staticcam : view-direction의 영향을 확인하고자 할 때 사용
  기존 c2w는 view-direction MLP input 만드는 데만 사용하고
  c2w_staticcam으로 rendering 위한 rays_o, rays_d 다시 계산
output :
- rgb_map : shape (B, 3)
  predicted RGB values for B개의 rays
- disp_map : shape (B,)
  disparity map (inverse of depth)
- acc_map : shape (B,)
  sum of sample weights along each ray
- extras : 나머지 dict from render_rays()
  fine-MLP를 사용하는 경우에만 존재
  - rgb0, disp0, acc0 : from coarse-MLP
  - z_std : shape (B,)
    std of distances (\(t\) 값) of fine samples for each ray

rays :
- if use_viewdirs = True : shape (N_rand, 8)
  dim=1 : 3(rays_o) + 3(rays_d) + 1(near) + 1(far)
- if use_viewdirs = False : shape (N_rand, 11)
  dim=1 : 3(rays_o) + 3(rays_d) + 1(near) + 1(far) + 3(viewdirs)
all_ret : dict
- rgb_map : shape (N_rand, 3)
- disp_map : shape (N_rand,)
- acc_map : shape (N_rand,)
- raw : MLP raw output (raw2outputs() 안 한 것)
- rgb0, disp0, acc0 : from coarse-MLP
- z_std : shape (N_rand,)
render() output :
rgb_map, disp_map, acc_map, (나머지 모아놓은)-dict

ndc_rays

shift ray origin to near plane :
NDC를 적용하기 전에 3D ray origin \(o\) 을 near plane 위 \(o_n\) 으로 옮긴다
(world-coordinate에서 ray가 near plane에서 출발하도록)
by \(o_n = o + t_nd\)
where z-axis에서는 \(-n = o_z + t_nd_z\) 이므로 \(t_n = \frac{-(n+o_z)}{d_z}\)
where n은 argument(near)
project ray to NDC-space :
ray \(r = o_n + td\) 를 NDC로 projection했을 때
projected ray \(r^{\ast} = o^{\ast} + t^{\ast} d^{\ast}\) 에서
\(o^{\ast} = \begin{bmatrix} -\frac{f_{cam}}{\frac{W}{2}}\frac{o_{n_x}}{o_{n_z}} \\ -\frac{f_{cam}}{\frac{H}{2}}\frac{o_{n_y}}{o_{n_z}} \\ 1 + \frac{2n}{o_{n_z}} \end{bmatrix}\) where n은 argument(near)
and
\(t^{\ast} = \frac{td_z}{o_{n_z} + td_z} = 1 - \frac{o_{n_z}}{o_{n_z} + td_z}\)
and
\(d^{\ast} = \begin{bmatrix} -\frac{f_{cam}}{\frac{W}{2}}(\frac{d_x}{d_z} - \frac{o_{n_x}}{o_{n_z}}) \\ -\frac{f_{cam}}{\frac{H}{2}}(\frac{d_y}{d_z} - \frac{o_{n_y}}{o_{n_z}}) \\ -2n\frac{1}{o_{n_z}} \end{bmatrix}\) where n은 argument(near)

batchify_rays

Out-of-Memory를 방지하기 위해 N_rand-개의 rays를 더 작은 chunk (B개)로 쪼개서 rendering
ret : render_rays()의 output
dict
- rgb_map : shape (B, 3)
  predicted RGB values by alpha-compositing
- disp_map : shape (B,)
  disparity map (inverse of depth)
- acc_map : shape (B,)
  sum of sample weights along each ray
- raw : MLP raw output (raw2outputs() 안 한 것)
- rgb0, disp0, acc0 : from coarse-MLP
- z_std : shape (B,)
  std of distances (\(t\) 값) of fine-samples for each ray
all_ret : B-개씩 쪼개서 rendering한 걸 다시 N_rand-개로 합침
dict
- rgb_map : shape (N_rand, 3)
- disp_map : shape (N_rand,)
- acc_map : shape (N_rand,)
- raw : MLP raw output (raw2outputs() 안 한 것)
- rgb0, disp0, acc0 : from coarse-MLP
- z_std : shape (N_rand,)

render_rays

ray_batch of shape (B, 8) or (B, 11)로부터
rays_o, rays_d, near, far, viewdirs 분리

Stratified Sampling of distance \(t\) for coarse-MLP :
z_vals : shape (B, N_samples) = (N_rays, N_samples)
stratified sampled distance \(t\)
- Let 균등한 간격을 나타내는 \(t_{vals} \in [0, 1]\) has shape (N_samples,)
- if lindisp = False:
  sample linearly in depth
  \(z_{vals} = near \cdot (1-t_{vals}) + far \cdot (t_{vals})\)
- if lindisp = True:
  sample linearly in inverse-depth
  \(z_{vals} = \frac{1}{\frac{1}{near} \cdot (1-t_{vals}) + \frac{1}{far} \cdot (t_{vals})}\)
- if perturb = True:
  add randomness

perturb=False이면 맨 윗줄을 coarse-samples로 쓰고, perturb=True이면 맨 아랫줄을 coarse-samples로 쓴다

pts, viewdirs : coarse-MLP input
- pts : position info. \(r = o + td\) of shape (B, N_samples, 3)
- viewdirs : view-direction info. of shape (B, 3)
raw : coarse-MLP output
shape (B, N_samples, 4) where 4 : for RGB, opacity

Inverse-transform Sampling of distance \(t\) for fine-MLP :
- coarse-samples :
  z_vals : shape (B, N_samples) = (N_rays, N_samples)
- fine-samples :
  coarse-MLP의 MLP output raw에 대해 raw2outputs()로 구한 weights 값을 Fine-Sampling에 사용
  z_samples : shape (B, N_importance)
- total sorted samples for fine-MLP :
  z_vals : shape (B, N_samples + N_importance)
pts, viewdirs : fine-MLP input
- pts : position info. \(r = o + td\) of shape (B, N_samples + N_importance, 3)
- viewdirs : view-direction info. of shape (B, 3)
raw : fine-MLP output
shape (B, N_samples + N_importance, 4) where 4 : RGB, opacity

sample_pdf

input :
- z_vals_mid : shape (B, N_samples - 1)
  stratified samples 사이의 중점
- weights[…, 1:-1] : shape (B, N_samples - 2)
  시작점, 끝점 빼고 weight of each stratified sample
- det : stratified samples에 randomness 부여했다면 False

pdf : shape (B, N_samples - 2)
\(\frac{w_i}{\sum_{j=1}^{num_{N_samples - 2}} w_j}\)
cdf : shape (B, N_samples - 1)
\(F_i = \sum_{j=1}^{i-1} f_j\)
by torch.cumsum()
각 row는 0 ~ 1 에서 점점 증가하는 수로 이루어져 있음
u : shape (B, N_importance)
- det가 True (no randomness)일 경우 :
  \(\begin{bmatrix} 0 & \frac{1}{N_{importance}-1} & \cdots & 1 \\ \vdots & \vdots & \ddots & \vdots \end{bmatrix}\)
- det가 False (randomness)일 경우 :
  0 ~ 1 사이의 random float로 이루어져 있음
inds : shape (B, N_importance)
u를 cdf의 어디에 끼워넣을 수 있는지에 대한 index
by torch.searchsorted()
below : shape (B, N_importance)
max(0, inds - 1)
above : shape (B, N_importance)
min(N_samples - 2, inds)
inds_g : shape (B, N_importance, 2) and range [0, N_samples - 1)
u가 위치할 수 있는 cdf의 두 경계의 index를 의미
cdf_g : shape (B, N_importance, 2)
torch.gather(cdf.expand(), 2, inds_g)
inds_g에 따라 cdf의 값(확률값)을 추출해옴
where cdf.expand() : shape (B, N_importance, N_samples - 1)
where inds_g : shape (B, N_importance, 2) and range [0, N_samples - 1)
bins_g : shape (B, N_importance, 2)
torch.gather(bins.expand(), 2, inds_g)
inds_g에 따라 bins의 값(coarse-samples 사이의 중점 \(t\) 값)을 추출해옴
where bins.expand() : shape (B, N_importance, N_samples - 1)
where inds_g : shape (B, N_importance, 2) and range [0, N_samples - 1)
denom : shape (B, N_importance)
u가 위치할 수 있는 구간의 cdf 값 차이
t : shape (B, N_importance)
u가 구간 내에서 차지하는 상대적인 위치
samples : shape (B, N_importance)
fine samples의 \(t\) 값
bins_g[…, 0]과 bins_g[…, 1] 사이의 값

CDF 가로축의 empty circles는 coarse(stratified) samples 사이의 중점(mid-point)의 t값

render_rays() output : dict
- rgb_map : shape (B, 3)
  predicted RGB values by alpha-compositing
- disp_map : shape (B,)
  disparity map (inverse of depth)
- acc_map : shape (B,)
  sum of sample weights along each ray
- raw : MLP raw output (raw2outputs() 안 한 것)
- rgb0, disp0, acc0 : from coarse-MLP
- z_std : shape (B,)
  std of distances (\(t\) 값) of fine-samples for each ray

raw2outputs

input :
- raw : shape (B, num_samples, 4)
dists : shape (B, num_samples)
\(\delta_{i}\) : sample 간의 간격 in world-coordinate
sample 간의 간격 in t-coordinate 에 \(\| d \|\) 곱해서 구함
(dists[:, -1]은 마지막 sample부터 inf까지의 간격을 의미하는 매우 큰 수 1e10)
rgb : shape (B, num_sample, 3)
\(c_i\) : raw-RGB에 sigmoid 씌운 값
sigmoid(raw[…, :3])
so that \(c_i \in (0, 1)\)
alpha : shape (B, num_samples)
\(\alpha_{i} = 1 - \exp(- \sigma_{i} \delta_{i})\)
where \(\sigma_{i}\) : raw-opacity에 noise 더하고 relu 씌운 값
so that \(\sigma_{i} \in [0, \infty)\)
weights : shape (B, num_samples)
\(w_i = \alpha_{i} \times T_i\)
where \(T_i = \prod_{j=1}^{i-1} (1-\alpha_{j}+1e-10)\) is obtained by torch.cumprod()
output :
- rgb_map : shape (B, 3)
  predicted RGB values
  by volume rendering \(\hat{C}(r) = \sum_{i=1}^{num_{samples}} T_i \alpha_{i} c_i = \sum_{i=1}^{num_{samples}} w_i c_i\)
  - if white_bkgd (투명한 배경 대신 흰색) :
    \(\hat{C}(r) = \sum_{i=1}^{num_{samples}} w_i c_i + (1 - \sum_{i=1}^{num_{samples}} w_i)\)
    so that 투명해서 \(\sum_{i=1}^{num_{samples}} w_i\) 가 작을 때 RGB-color가 흰색(1.)에 가깝도록
- disp_map : shape (B,)
  disparity map (inverse of depth)
  by \(\frac{1}{max(1e-10, \frac{\sum_{i=1}^{num_{samples}} w_i t_i}{\sum_{i=1}^{num_{samples}} w_i})}\)
- acc_map : shape (B,)
  sum of sample weights along each ray
  by \(\sum_{i=1}^{num_{samples}} w_i\)
- weights : shape (B, num_samples)
  weight of each sample
  \(w_i = \alpha_{i} \times T_i\)
- depth_map : shape (B,)
  depth map (estimated distance to object)
  by \(\sum_{i=1}^{num_{samples}} w_i t_i\)
  (weight가 높은 깊이 값 \(t_i\) 을 더 많이 반영하는 식으로 weighted sum)

Evaluation

img2mse for loss and mse2psnr for psnr

loss = coarse-MLP-loss + fine-MLP-loss
where each is MSE loss b.w. predicted RGB and GT RGB of shape (N_rand, 3)
PSNR : \(PSNR = -10 * log_{10}(loss)\)
to8b : 0. ~ 1.에서 0 ~ 255 (8-bit)로 변환

test

args.i_video iter.마다
novel view(render_poses)에 대해 rendering해서
여러 장의 rgb_map과 disp_map을 동영상으로 저장
args.i_testset iter.마다
test view에 대해 rendering해서
한 장의 rgb_map을 사진으로 저장

render_path

inference rendering (한 장씩)
빠른 rendering을 위해 H, W, focal을 downsample

Question

Q1 : 왜 ndc_rays() 호출할 때 near bound n 값에 near = 1.으로 하드코딩해서 넣어주지?
A1 : ????
Q2 : 왜 blender dataset에서 render_poses 만들 때 phi=-30. 으로 하드코딩해서 넣어주지?
A2 : ????