3DGS Code Review

3D Gaussian Splatting for Real-Time Radiance Field Rendering (SIGGRAPH 2023)

code :
https://github.com/graphdeco-inria/gaussian-splatting
reference :
https://charlieppark.kr
NeRF and 3DGS Study
https://arxiv.org/abs/2401.03890

Code Flow Diagram

reference : https://charlieppark.kr

reference : NeRF and 3DGS Study 방장

train.py

Fig 1. train.py Algorithm

gaussians = GaussianModel(dataset.sh_degree)
scene = Scene(dataset, gaussians)

Gaussian Initialize

Fig 1.의 빨간 박스 : SfM pcd로부터 Gaussian param. 초기화

class Scene:
  def __init__(...):
    ...
    self.gaussians.create_from_pcd(scene_info.point_cloud, self.cameras_extent)  

scene 폴더의 \(\text{__init__.py}\) 에서 \(\text{class Scene.__init__()}\) :
- scene_info :
  Colmap 또는 Blender의 pcd, camera info.를 받아옴
  - pcd : scene_info.point_cloud
  - camera : scene_info.train_cameras, scene_info.test_cameras
- self.gaussians.create_from_pcd() :
  SfM pcd로부터 Gaussian param.들을 initialize

class Scene.__init__()

sceneLoadTypeCallbacks(dict) > readColmapSceneInfo

GaussianModel.create_from_pcd()

Densification

Fig 1.의 초록 박스 : densification (clone and split)

train.py

GaussianModel.add_densification_stats()

GaussianModel.densify_and_prune()

GaussianModel.densify_and_clone(), GaussianModel.densify_and_split()

GaussianModel.densification_postfix()

GS Rasterize

Fig 2. GS Rasterize Algorithm

Fig 2.의 노란 박스 : GS rasterization by CUDA

gaussian_renderer 폴더의 \(\text{__init__.py}\) 에서 \(\text{render()}\) :

render()

GaussianRasterizer(nn.Module) :
- C++/CUDA rasterization routine을 invoke
- nn.Module을 override하는 class의 forward()를 호출하려면
  model(…)
  vs.
  torch.autograd.Function를 override하는 _RasterizeGaussians의 forward()를 호출하려면
  _RasterizeGaussians.apply(…)

diff_gaussian_rasterization.__init__.py.GaussianRasterizer(nn.Module)

_RasterizeGaussians(torch.autograd.Function) :
- goal :
  - MLP weight가 아니라 3DGS param.를 gradient-based back-progapagation으로 update해야 함
  - torch.autograd.Function 함수를 override하여
    @staticmethod로 forward()와 backward() 정의
  - ctx.save_for_backward()와 ctx.saved_tensors를 이용해서
    forward()에서 backward()로 정보 전달
  - backward()에서 자동 미분(autograd) 말고 gradient를 manually 지정함으로써 효율적인 계산 가능
- torch.autograd.Function.forward()
  _C.rasterize_gaussians() 호출해서 rasterization 결과 반환
- torch.autograd.Function.backward()
  _C.rasterize_gaussians_backward() 호출해서 gradient 반환

diff_gaussian_rasterization.__init__.py._RasterizeGaussians(torch.autograd.Function).forward()

diff_gaussian_rasterization.__init__.py._RasterizeGaussians(torch.autograd.Function).backward()

setup.py 에서 setup() : C++/CUDA 확장
- ext_modules : pytorch용 CUDA Extension 정의
  - CUDAExtension() : pytorch에서 CUDA Extension 패키지를 빌드하는 module
  - name : compiled CUDA Extension 패키지 이름
  - sources : 컴파일할 C++/CUDA source codes

C++/CUDA code를 컴파일하여 diff_gaussian_rasterization._C 패키지를 만들기 위한 setup.py

ext.cpp 컴파일 :
PYBIND11_MODULE()로 python 함수와 rasterize_points.cu 의 CUDA 함수를 연결

python 함수와 rasterize_points.cu 의 CUDA 함수를 PYBIND11_MODULE()로 연결

rasterize_points.cu :
- RasterizeGaussiansCUDA()
  color, radii 저장을 위한 tensor 만들고 geometry, bin, image 저장을 위한 buffer 만든 뒤
  CudaRasterizer::Rasterizer::forward() 호출
  (namespace::class::static function 형식)
- RasterizeGaussiansBackwardCUDA()
  each param. gradient 저장을 위한 tensor 만든 뒤
  CudaRasterizer::Rasterizer::backward() 호출
- markVisible()
  present tensor 만든 뒤
  CudaRasterizer::Rasterizer::markVisible() 호출
buffer 만들기 : obtain() 함수 사용

rasterizer_impl.cu :
- int CudaRasterizer::Rasterizer::forward()
  (아래의 ### FORWARD section에서 설명)
  - FORWARD::preprocess() 호출
    \(\rightarrow\) \(\text{preprocessCUDA<> <<<,>>>()}\) 호출
  - cub::DeviceScan::InclusiveSum() 호출
  - \(\text{duplicatedWithKeys<<<,>>>()}\) 호출
  - cub::DeviceRadixSort::SortPairs() 호출
  - \(\text{identifyTileRanges<<<,>>>()}\) 호출
  - FORWARD::render() 호출
    \(\rightarrow\) \(\text{renderCUDA<> <<<,>>>()}\) 호출

// create a tile grid (a group of blocks)
dim3 tile_grid((width + BLOCK_X - 1) / BLOCK_X, (height + BLOCK_Y - 1) / BLOCK_Y, 1);
// create a block (a group of threads(pixels))
dim3 block(BLOCK_X, BLOCK_Y, 1);

rasterizer_impl.cu :
- void CudaRasterizer::Rasterizer::backward()
  (아래의 ### BACKWARD section에서 설명)
  - BACKWARD::render() 호출
  - BACKWARD::preprocess() 호출

rasterizer_impl.cu :
- void CudaRasterizer::Rasterizer::markVisible()
  - \(\text{checkFrustum<<<,>>>()}\) 호출
    \(\rightarrow\) \(\text{in_frustum()}\) 호출

FORWARD

\(\text{preprocessCUDA<NUMCHANNELS> <<<(P + 255) / 256, 256 >>>()}\)
where P : # of Gaussians
where one thread per each Gaussian

/* preprocessCUDA()의 main code flow */

// global rank of current thread (Gaussian)
auto idx = cg::this_grid().thread_rank();

in_frustum(/*...*/) // near culling

// compute 3D covariance in world-coord.
computeCov3D(/*...*/)
// compute 2D covariance in image-coord.(z=1) by EWA Splatting
computeCov2D(/*...*/)

// compute inverse of 2D covariance
float3 conic = { cov.z * det_inv, -cov.y * det_inv, cov.x * det_inv }; 

// image from ndc to pixel-coord.
float2 point_image = { ndc2Pix(p_proj.x, W), ndc2Pix(p_proj.y, H) };

// overlap 기준 : 반지름 = 3 * sigma
float my_radius = ceil(3.f * sqrt(max(lambda1, lambda2)));

// idx-th Gaussian과 겹치는 가장 왼/오른쪽, 위/아래쪽 tile의 index를 계산
getRect(point_image, my_radius, rect_min, rect_max, grid);

// idx-th Gaussian의 depth, radius 저장
depths[idx] = p_view.z; radii[idx] = my_radius;
// pixel-coord. image 저장
points_xy_image[idx] = point_image; 
// idx-th Gaussian의 inverse covariance, opacity 저장
conic_opacity[idx] = { conic.x, conic.y, conic.z, opacities[idx] }; 
// idx-th Gaussian과 겹치는 tile 개수 저장
tiles_touched[idx] = (rect_max.y - rect_min.y) * (rect_max.x - rect_min.x);

__global__ void preprocessCUDA()

__device__ bool in_frustum()

__device__ void computeCov3D()

// world-to-camera of 3D mean to obtain J
float3 t = transformPoint4x3(mean, viewmatrix);

__device__ float3 computeCov2D()

\(\text{__device__ void getRect()}\) :
- uint2 rect_min 에서는 idx-th Gaussian의 가장 왼쪽 (p.x - \(3 \sigma\))과 겹치는 tile의 index를 계산
  - \(3 \sigma\) : 99.7% confidence (Gaussian의 max. 반지름으로 간주)
  - p.x - max_radius 는 pixel-coord.인데
    tile 너비 당 pixel 수인 BLOCK_X 로 나누면
    tile index가 됨
  - grid.x 는 x축 상의 grid 내 tile(block) 개수
- uint2 rect_max 에서는 idx-th Gaussian의 가장 오른쪽 (p.x + \(3 \sigma\))과 겹치는 tile의 index를 계산
  - BLOCK_X - 1 을 분자에 더해주는 이유는
    C++에서 / BLOCK_X 계산하는 게 내림이기 때문

__device__ float ndc2Pix() and __device__ void getRect()

cub::DeviceScan::InclusiveSum()
- Gaussian instance를 몇 개 만들어야 하는지 계산하기 위해
  inclusive(prefix) sum 수행
- GPU parallel computing을 지원하는 CUDA library인 CUB 사용

/*
// 각 Gaussian이 touch한 tile 개수
geomState.tiles_touched : [2, 3, 0, 2, 1]의 주소

// inclusive sum 수행한 후의 output array
geomState.point_offsets : [2, 5, 5, 7, 8]의 주소

// Gaussian이 touch한 총 tile 개수 또는 duplicated Gaussian instance 수
num_rendered = *(geomState.point_offsets + P-1) : 8
*/
CHECK_CUDA(cub::DeviceScan::InclusiveSum(geomState.scanning_space, geomState.scan_size, geomState.tiles_touched, geomState.point_offsets, P), debug)

\(\text{duplicatedWithKeys<<<(P+255)/256, 256>>>()}\)
one thread per each Gaussian
- idx-th Gaussian과 겹치는 tile 개수만큼 (idx: current thread rank)
  duplicated Gaussian instance의 key-value pair 만들기
  - 64-bit key : tileID-depth
    - tileID : y * grid.x + x
    - depth : depths[idx]
  - value : GaussianID
    - GaussianID : idx

cub::DeviceRadixSort::SortPairs()
- key를 기준으로 정렬 (in parallel)

from collections import deque

# bit sequence 말고 자연수를 정렬한다고 할 때
# 1의 자릿수 기준으로 정렬한 뒤
# 10의 자릿수 기준으로 정렬한 뒤
# ...
def radixSort():
    nums = list(map(int, input().split(' ')))
    buckets = [deque() for _ in range(10)] # 각 자릿수(0~9)에 대응되는 10개의 empty deque()
    
    max_val = max(nums)
    queue = deque(nums) # 정렬할 숫자들
    digit = 1 # 정렬 기준이 되는 자릿수
    
    while (max_val >= digit): # 가장 큰 수의 자릿수일 때까지만 실행
        while queue:
            num = queue.popleft() # 정렬할 숫자
            buckets[(num // digit) % 10].append(num) # 각 자릿수(0~9)에 따라 buckets에 num을 넣는다.
        
        # 해당 정렬 기준 자릿수에서 buckets에 다 넣었으면, buckets에 담겨있는 순서대로 꺼내와서 queue에 넣음
        for bucket in buckets:
            while bucket:
                queue.append(bucket.popleft())

        digit *= 10 # 정렬 기준이 되는 자릿수 증가시키기
    
    print(list(queue))

\(\text{identifyTileRanges<<<,>>>()}\)
one thread per each duplicated Gaussian instance
- 아래 코드 설명 :
  tile1-depth1 (Gaussian0), tile1-depth2 (Gaussian1), tile2-depth1 (Gaussian2), tile2-depth3 (Gaussian3) 으로 sort되었고
  idx = 2 (Gaussian2) 일 때
  - key : tile2-depth1
  - currtile : tile2
  - prevtile : tile1
  - currtile과 prevtile이 다르므로, 즉 idx는 처음으로 tileID가 바뀐 Gaussian instance에 해당하므로
    prevtile의 TileRange의 끝 지점과
    currtile의 TileRange의 시작 지점을
    GaussianID인 idx로 설정

\(\text{renderCUDA<NUMCHANNELS> <<<grid, block>>>()}\)
where one block for each tile
where one thread for each pixel

/* renderCUDA()의 main code flow */

// current block
// block.group_index() : current block의 (x, y) index
auto block = cg::this_thread_block();

// current block's first pixel
uint2 pix_min = { block.group_index().x * BLOCK_X, block.group_index().y * BLOCK_Y };
// current thread
// pixf : current thread의 (x, y) index
float2 pixf = { (float)(pix_min.x + block.thread_index().x), (float)(pix_min.y + block.thread_index().y) };

// check if current thread corresponds to a valid pixel
bool inside = pix.x < W&& pix.y < H;

// current tile의 Gaussians 수 (range.y - range.x)를 처리하기 위한 batch 수
const int rounds = ((range.y - range.x + BLOCK_SIZE - 1) / BLOCK_SIZE);

// current block 내 모든 threads가 __syncthreads_count()라는 barrier에 도달해서
// done==True를 만족하는 threads 수 반환
int num_done = __syncthreads_count(done);

for (int i = 0; i < rounds; i++, toDo -= BLOCK_SIZE)
{
    /* fetch step */
    // current block 내 모든 threads가 각자 progress-th Gaussian을 병렬적으로 fetch하여
    // BLOCK_SIZE=256개(batch)만큼씩 fetch into shared memory
    // block.thread_rank() : current thread의 local rank
    int progress = i * BLOCK_SIZE + block.thread_rank();
    if (range.x + progress < range.y){
        int coll_id = point_list[range.x + progress];
        collected_id[block.thread_rank()] = coll_id;
        collected_xy[block.thread_rank()] = points_xy_image[coll_id];
        collected_conic_opacity[block.thread_rank()] = conic_opacity[coll_id];
    }
    // current block 내 모든 threads가 여기 barrier에 도달할 때까지 대기
    block.sync();

    /* rasterization step */
    // thread(pixel)마다 병렬적으로 BLOCK_SIZE=256개(batch)의 Gaussians를 앞에서부터 alpha-compositing
    // accumulated opacity T 가 너무 작아지면 해당 threads는 alpha-compositing 종료
    for (int ch = 0; ch < CHANNELS; ch++)
	    C[ch] += features[collected_id[j] * CHANNELS + ch] * alpha * T;
    for (int ch = 0; ch < CHANNELS; ch++)
		// 마지막에 bg color까지 alpha-compositing
        out_color[ch * H * W + pix_id] = C[ch] + T * bg_color[ch];
}

__global__ void __launch_bounds__(BLOCK_X * BLOCK_Y) renderCUDA()

BACKWARD

BACKWARD::render()
BACKWARD::preprocess()
nn.Module의 MLP weight가 아니라 3DGS param.를 gradient-based back-progapagation으로 update해야 하므로
autograd(자동미분)에 의존하지 않고 각 3DGS param.의 gradient를 계산하는 과정을 직접 implement
- 자세한 과정 설명은 생략하겠음!
  공부하고 싶다면 backward.cu 와 blog 의 param. gradient 직접 유도 (Appendix A.) 부분 참고!!