CUDA Programming

.cu coding

Lecture :
24F EE514 Parallel Computing
by KAIST Minsoo Rhu VIA Research Group

SPMD Programming

GPGPU Programming :
- serial part : in CPU C code (host)
- parallel part : in GPU SPMD kernel code (device)
SPMD (Single Program Multiple Data) :
- grid (kernel) \(\supset\) block (SM) \(\supset\) warp \(\supset\) thread
  - gridDim : grid 내 block 개수
  - blockIdx : block index
  - blockDim : block 내 thread 개수
  - threadIdx : thread index
- 보통
  1 warp = 32 threads
  1 block = 256 threads
- block 내 threads끼리 shared memory를 공유
memory address space :
- 1 address에 1 Byte를 저장하므로
  memory address가 32-bit일 때
  \(2^{32}\) Byte 저장 가능
- linear memory address space를 implement하는 것은 복잡
Shared Memory Model
- shared var. in shared address space에 저장함으로써 threads끼리 communicate
- atomicity : threads끼리 겹치지 않도록 mutual exclusion
  - semaphore
  - mutex : \(\text{LOCK(mylock); //critical section UNLOCK(mylock);}\)
  - atomic : \(\text{atomic{//critical section}}\) 또는 \(\text{atomicAdd(x, 10);}\)
- efficient implementation을 위해 hardware support 필요
  processor 수가 많으면 costly할 수 있음
- e.g. OpenMP

Message Passing Model
- thread는 각자 private address space를 가지고 있고
  threads끼리 message 주고받음으로써 communicate
- system-wide load/store를 위한 hardware implementation 필요 없음
- e.g. Open MPI

CUDA Programming

CUDA APIs :
- \(\text{cudaMalloc()}\) : device(GPU) global memory에 allocate
- \(\text{cudaFree()}\) : device(GPU) global memory free
- \(\text{cudaMemcpy()}\) : data transfer between host(CPU) and device(GPU)

CUDA Function :
- \(\text{__global__}\) :
  - kernel function 정의할 때 사용 (return void)
  - host에서 call해서 device에서 execute
- \(\text{__device__}\) :
  - device에서 call해서 device에서 execute
- \(\text{__host__}\) :
  - host에서 call해서 host에서 execute
- \(\text{__global__}\) and \(\text{__device__}\) :
  device(GPU)에서 execute해야 하므로
  - 정의 :
    e.g. \(\text{template<uint32_t C> __global__ void __launch_bounds__(BLOCK_X * BLOCK_Y) renderCUDA()}\)
    - \(\text{template<uint32_t C>}\) :
      최적화를 위해 runtime 말고 compile-time에 미리 값이 결정되는 param.
    - \(\text{__launch_bounds__(BLOCK_X * BLOCK_Y)}\) :
      최적화를 위해 CUDA kernel의 block 당 thread 개수(e.g. 256)를 명확하게 지정
  - 호출 :
    e.g. \(\text{vecAddKernel<C> <<<grid, block>>> (param.)}\)
    - \(\text{uint32_t C}\) : template var.
    - \(\text{dim3 grid}\) : grid 내 block 개수
    - \(\text{dim3 block}\) : block 내 thread 개수
- function 정의할 때 param.를 copy-by-reference 하려면 (변수를 참조해서 수정하려면)
  \(\text{float3& p}\) 처럼 param. 자료형 뒤에 & 사용

// define kernel func.
__global__ void vecAddKernel(float* A, float* B, float* C, int n)
{
  // global rank
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  if (i < n)
    C[i] = A[i] + B[i];
}

// n : global rank
dim3 DimGrid((n-1)/256 + 1, 1, 1);
// 256 threads per block
// DimBlock.x = 256
dim3 DimBlock(256, 1, 1);

// call kernel func.
// kernel func.<<<#block, #thread>>>(param.)
vecAddKernel<<<DimGrid, DimBlock>>>(d_A, d_B, d_C, n);

limitation : bottlenecked by global memory bandwidth
\(\rightarrow\)
solution : scratchpad memory (shared memory)
- cache : transparent to programmer (it just works)
- scratchpad : programmer has to manually manage data movement

CUDA Variable :
- \(\text{int LocalVar;}\) :
  register에 저장하여 thread 혼자서 사용
- \(\text{(__device__) __shared__ int SharedVar;}\) :
  shared memory에 저장하여 block 내 threads끼리 공유
- \(\text{__device__ int GlobalVar;}\) :
  global memory에 저장하여 grid 내 모든 threads가 공유
- \(\text{(__device__) __constant__ int SharedVar;}\) :
  constant memory에 저장하여 grid 내 모든 threads가 공유

Tiled Matrix Multiplication

Matrix Multiplication without shared memory :
each thread has to access global memory,
so performance is bottlenecked by global memory bandwidth

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width){
  int row = blockIdx.y * blockDim.y + threadIdx.y;

  int col = blockIdx.x * blockDim.x + threadIdx.x;

  if ((row < Width) && (col < Width)){
    float value = 0;
    // each thread computes one element of output matrix
    for (int k = 0; k < Width; ++k){
      value += M[row * Width + k] * N[k * Width + col];
    }
    
    P[row * Width + col] = value;
  }
}

Tiling Algorithm :
- 17p TBD