CUDA 5

Presenter Notes

SGEMM cont'd:

Mapeos no 1-a-1.
SGEMM con shmem, 2 versiones.
ILP, "Volkov style".
CUBLAS.

20180612

Presenter Notes

Mapeos no 1-a-1

Presenter Notes

Mapeo 1 hilo, muchos `c[i][j]`

Kernel

1 __global__ void sgemm_multi(const uint N, const uint MULTI, float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
2     const uint i = blockIdx.y*blockDim.y + threadIdx.y;
3     const uint j = MULTI*(blockIdx.x*blockDim.x + threadIdx.x);
4     if (i<N && j<N)
5         for (uint jj=j; jj<j+MULTI; ++jj)
6             for (uint k=0; k<N; ++k)
7                 c[IX(i,jj)] += a[IX(i,k)] * b[IX(k,jj)];
8 }

Host

1 dim3 grid_size_sgemm(DIV_CEIL(N,BX*MULTI), DIV_CEIL(N,BY));
2 dim3 block_size_sgemm(BX,BY);
3 sgemm_multi<<<grid_size_sgemm, block_size_sgemm>>>(N, MULTI, d_a, d_b, d_c);

Mapeo de hilos a celdas

(i,j) <-> { (i,j*MULTI),...,(i,j*MULTI+MULTI-1) }

Presenter Notes

Corrección y Performance

1 $ for i in {1,2,4,8}; do ./sgemm-multi 1024 16 16 $i; done
2 max_diff: 0.000092
3 max_diff: 0.000092
4 max_diff: 0.000092
5 max_diff: 0.000092

Performance

1 $ for i in {1,2,4,8}; do nvprof ./sgemm-multi 1024 16 16 $i 2>&1 | grep sgemm_multi; done
2  6.8601ms         1  6.8601ms  6.8601ms  6.8601ms  sgemm_multi(unsigned int, unsigned int, float*, float*, float*)
3  7.0656ms         1  7.0656ms  7.0656ms  7.0656ms  sgemm_multi(unsigned int, unsigned int, float*, float*, float*)
4  16.195ms         1  16.195ms  16.195ms  16.195ms  sgemm_multi(unsigned int, unsigned int, float*, float*, float*)
5  19.135ms         1  19.135ms  19.135ms  19.135ms  sgemm_multi(unsigned int, unsigned int, float*, float*, float*)

No hay mejora de performance.
Pero si en la comunicación, usé cudaMallocHost() y pasé de 2.5 GiB/s a 12 GiB/s.

1 $ nvprof --print-gpu-trace ./sgemm-multi 1024 16 16 1
2 ...
3 402.54ms  327.01us  4.0000MB  11.945GB/s  GeForce GTX TIT  [CUDA memcpy DtoH]
4 402.89ms  323.52us  4.0000MB  12.074GB/s  GeForce GTX TIT  [CUDA memcpy DtoH]
5 403.22ms  323.52us  4.0000MB  12.074GB/s  GeForce GTX TIT  [CUDA memcpy DtoH]

Presenter Notes

Memoria compartida en bloque

Presenter Notes

SGEMM usando shared: estrategia

Figure 3-2. Matrix Multiplication with Shared Memory

NVIDIA, NVIDIA CUDA C Programming Guide, 2018.

Presenter Notes

SGEMM usando shared: el kernel

 1 __global__ void sgemm(float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
 2     __shared__ float sa[B][B];
 3     __shared__ float sb[B][B];
 4 
 5     const uint ti = threadIdx.y; const uint tj = threadIdx.x;
 6     const uint i = blockIdx.y*B + ti;
 7     const uint j = blockIdx.x*B + tj;
 8     float cij = 0.0f;
 9     for (uint m=0; m<N/B; ++m) {
10         sa[ti][tj] = a[IX(i,m*B+tj)];
11         sb[ti][tj] = b[IX(m*B+ti,j)];
12         __syncthreads();
13 
14         for (uint k=0; k<B; ++k)
15             cij += sa[ti][k] * sb[k][tj];
16         __syncthreads();
17     }
18     c[IX(i,j)] = cij;
19 }

» Mucho más complejo de entender.
» Usa las características de cooperación entre hilos dentro de un bloque: shared, synchronization.
» Parecido a OpenMP, el mismo código tiene variables en 3 niveles de alcance: reg, shmem, global.

Presenter Notes

Otra estrategia para las constantes

Cambiamos de una estrategia de const a #define.
Simplifica algunas cosas, requiere recompilar.

 1 #ifndef N
 2 #define N 1024
 3 #endif
 4 
 5 #ifndef B
 6 #define B 32
 7 #endif
 8 
 9 #define SIZE (N*N)
10 assert(0 == N%B); // evenly split
11 ...
12 
13 dim3 grid_size(N/B, N/B);
14 dim3 block_size(B,B);
15 ...
16 sgemm<<<grid_size, block_size>>>(d_a, d_b, d_c);

Notar que no tenemos el típico if en el kernel:

1 if (i<N && j<N) {
2     ...
3 }

Presenter Notes

Tamaños de shared per SM (C2070, 2012)

N=1024, C2070, potencia pico 1.030 TFLOPS single precision.

1 Block   Shared   gputime   GFLOPS    Eff
2 8×8     0.5 KB   20.0 ms   100       10%
3 16×16   2 KB     12.1 ms   165       16%
4 32×32   8 KB     12.2 ms   163       16%

La performance trepó a 165 GFLOPS.

1 >>> ((2*1024**3)/0.0121) / (1<<30)
2 165.28925619834712

C2070, resumen

1 Version  GFLOPS EFICIENCIA
2 Trivial      60         6%
3 Shared      165        16%

Presenter Notes

Resumen shared (K40, 2014)

N=1024, K40, potencia pico 4.291 TFLOPS single precision.

1 Block   Shared   gputime   GFLOPS    Eff
2 8×8     0.5 KB   8.20 ms      243    5%
3 16×16   2 KB     5.30 ms      377    8%
4 32×32   8 KB     5.18 ms      386    8.5%

Aumenté mucho los GFLOPs, pero la eficiencia bajó dramáticamente.

Típico de "mejoré para Fermi, pero para Kepler no camina".

Hay una base de 2x de speedup de full Fermi a full Kepler.
Pero la eficiencia 0.5x.
Moraleja: Kepler Tuning Guide.

Hay más!

Presenter Notes

Resumen shared (GTX Titan X, 2016)

N=1024, GTX Titan X, potencia pico 6.144 TFLOPS single precision.

1 Block   Shared   gputime   GFLOPS    Eff
2 8×8     0.5 KB   3.07 ms      651    10.5%
3 16×16   2 KB     2.65 ms      754    12%
4 32×32   8 KB     2.67 ms      749    12%

La eficiencia mejoró.

Presenter Notes

Resumen shared (GTX 1080 Ti, 2018)

N=1024, GTX 1080 Ti, potencia pico 10609 TFLOPS single precision.

1 Block   Shared   gputime   GFLOPS    Eff
2 8×8     0.5 KB   1.71 ms     1169    11%
3 16×16   2 KB     1.48 ms     1351    12.7%
4 32×32   8 KB     1.46 ms     1369    12.9%

La eficiencia mejoró un pelín respecto a Maxwell.

Presenter Notes

Incremento de granularidad

Hwu trick, sección 6.6.
Reutilizar el block sa (Md) para dos o más bloques sb (Nd).

Increased thread granularity with rectangular tiles

Presenter Notes

El código

 1 __global__ void sgemm_shared2(float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
 2     __shared__ float sa[B][B];
 3     __shared__ float sb0[B][B];
 4     __shared__ float sb1[B][B];
 5 
 6     const uint ti = threadIdx.y;
 7     const uint tj = threadIdx.x;
 8     const uint i = blockIdx.y*B + ti;
 9     const uint j0 = (2*blockIdx.x+0)*B + tj;
10     const uint j1 = (2*blockIdx.x+1)*B + tj;
11     float cij0 = 0.0f, cij1 = 0.0f;
12     for (uint m=0; m<N/B; ++m) {
13         sa[ti][tj] = a[IX(i,m*B+tj)];
14         sb0[ti][tj] = b[IX(m*B+ti,j0)];
15         sb1[ti][tj] = b[IX(m*B+ti,j1)];
16         __syncthreads();
17 
18         for (uint k=0; k<B; ++k) {
19             cij0 += sa[ti][k] * sb0[k][tj];
20             cij1 += sa[ti][k] * sb1[k][tj];
21         }
22         __syncthreads();
23     }
24     c[IX(i,j0)] = cij0;
25     c[IX(i,j1)] = cij1;
26 }

Se lanzan grid_size.x /= 2 bloques.

Presenter Notes

Medición K40 (2014)

Para N=1024, B=16 corriendo en K40 con nvcc-6.0.
Los otros tamaños de bloque posible (8 y 32) dan peores resultados.

Performance

1 Gránulo  Tiempo  GFLOPS
2       1  5.36ms     373
3       2  4.36ms     458
4       4  3.92ms     510

Consumo de memoria

1 Gránulo  Regs  ShMem
2       1    25   2048
3       2    32   3072
4       4    37   5120

Mejoramos pero estamos aun lejos de una eficiencia decente.

Presenter Notes

Medición GTX Titan X (2016)

Para N=1024, B=16 corriendo en GTX Titan X con nvcc-7.5 -arch=sm_52.
Los otros tamaños de bloque posible (8 y 32) dan peores resultados.

Performance

1 Gránulo  Tiempo  GFLOPS
2       1  2.65ms     754
3       2  2.10ms     952
4       4  1.89ms    1058

Consumo de memoria (casi idem)

1 Gránulo  Regs  ShMem
2       1    23   2048
3       2    32   3072
4       4    32   5120

¡Pasamos la barrera del TFLOP!

Presenter Notes

Medición GTX 1080 Ti (2018)

Para N=1024, B=16 corriendo en GTX 1080 Ti con nvcc-8 -arch=sm_61.
Los otros tamaños de bloque posible (8 y 32) dan peores resultados.

Performance

1 Gránulo  Tiempo  GFLOPS     Eff
2       1  1.49ms    1342   12.6%
3       2  1.17ms    1709   16.0%
4       4  1.02ms    1960   18.5%

Consumo de memoria (casi idem)

1 Gránulo  Regs  ShMem
2       1    23   2048
3       2    31   3072
4       4    32   5120

Estamos llegando a los 2TFLOPS y 20% de eficiencia.

Presenter Notes

ILP y Volkov style

Presenter Notes

Better performance at lower occupancy

Menos hilos, pero más bloques por SMX.
Vasily Volkov, Better performance at lower occupancy, GTC, 2010.

Código

La primera parte es igual al código de shared.

1 __global__ void sgemm_shared(float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
2     __shared__ float sa[B][B];
3     __shared__ float sb[B][B];
4 
5     const uint ti = threadIdx.y;
6     const uint tj = threadIdx.x;
7     const uint i = blockIdx.y*B + ti;
8     const uint j = blockIdx.x*B + tj;

Idea: hacer más trabajo por hilo, aumentar la granularidad, pero ...

En vez de lanzar una grilla la mitad de alta, se lanzan bloques la mitad de altos: block_size.y /= 2

Presenter Notes

El código

 1     float cij[2] = {0.0f, 0.0f};
 2     for (uint m=0; m<N/B; ++m) {
 3         sa[ti][tj] = a[IX(i,m*B+tj)];
 4         sb[ti][tj] = b[IX(m*B+ti,j)];
 5         sa[ti+B/2][tj] = a[IX(i+B/2,m*B+tj)];
 6         sb[ti+B/2][tj] = b[IX(m*B+ti+B/2,j)];
 7         __syncthreads();
 8 
 9         for (uint k=0; k<B; ++k) {
10             cij[0] += sa[ti][k] * sb[k][tj];
11             cij[1] += sa[ti+B/2][k] * sb[k][tj];
12         }
13         __syncthreads();
14     }
15     c[IX(i,j)] = cij[0];
16     c[IX(i+B/2,j)] = cij[1];
17 }

El mismo trabajo se hace con bloques más pequeños.
Entran más bloques por SM.
Se explota mejor el ILP.

Tradeoff entre TLP (thread level paralelism) e ILP (instruction level paralelism).

Presenter Notes

Medición (K40, 2014)

Para N=1024, B=32 en una K40 compilado con nvcc-6.0.

1 Grano  Regs  ShMem  Time(ms)  GFLOPS
2     1    21   8192      5.18     386
3     2    31   8192      3.46     578
4     4    42   8192      2.73     732
5     8    49   8192      2.53     790
6    16    40   8192     22.89      87

Volvimos a tener un ~17% de eficiencia.

Presenter Notes

Medición (GTX Titan X, 2016)

Para N=1024, B=32 en una GTX Titan compilado con nvcc-7.5 -arch=sm_52.

1 Grano  Regs  ShMem  Time(ms)  GFLOPS
2     1    21   8192      2.61     766
3     2    31   8192      1.67    1197
4     4    42   8192      1.37    1459
5     8    49   8192      1.17    1709
6    16    40   8192      1.79    1117

Llegamos a una eficiencia del 28%.
¡Wowwww!

Presenter Notes

Medición (GTX 1080 Ti, 2018)

Para N=1024, B=32 en una GTX 1080 Ti compilado con nvcc-8 -arch=sm_61.
Cuidado: le tuve que sacar unos unroll para que compile!

1 Grano  Regs  ShMem  Time(ms)  GFLOPS
2     1    23   8192     1.47     1360
3     2    32   8192     0.758    2638
4     4    35   8192     0.622    3215
5     8    40   8192     0.586    3412
6    16    32   8192     5.56      359

Llegamos a 3.4 TFLOPS y una eficiencia del 32%.

Presenter Notes

cuBLAS

Presenter Notes

El código

 1 ...
 2 #include <cublas_v2.h>
 3 ...
 4 float beta = 0.0f;
 5 cublasHandle_t handle;
 6 cublasStatus_t status;
 7 status = cublasCreate(&handle);
 8 assert(status == CUBLAS_STATUS_SUCCESS);
 9 status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_a, N, d_b, N, &beta, d_c, N);
10 assert(status == CUBLAS_STATUS_SUCCESS);
11 
12 checkCudaErrors(cudaMemcpy(h_a, d_a, SIZE*sizeof(float), cudaMemcpyDefault));
13 checkCudaErrors(cudaMemcpy(h_b, d_b, SIZE*sizeof(float), cudaMemcpyDefault));
14 checkCudaErrors(cudaMemcpy(h_c, d_c, SIZE*sizeof(float), cudaMemcpyDefault));
15 status = cublasDestroy(handle);
16 assert(status == CUBLAS_STATUS_SUCCESS);
17 ...

Compilamos con

1 $ nvcc blablebli -lcublas

Presenter Notes

Mediciones GTX Titan X Maxwell

1 $ ./sgemm-cublas 1024
2 ...
3  26.04%  530.28us         1  530.28us  530.28us  530.28us  maxwell_sgemm_128x64_nn

3373 GFLOPS.

55% de eficiencia.

Moraleja: library'em!.

Ojo, para N={2048, 4096} la eficiencia sube al 79%.

Pregunta

¿Cuál es más eficiente para SGEMM una CPU o una GPU?

Comparar con lo hecho en CPU.

Presenter Notes

Mediciones GTX 1080 Ti

Compilamos con CUDA 9 y linkeamos CUBLAS 9

1 $ ./sgemm-cublas 1024
2 ...
3 15.73%  215.41us         1  215.41us  215.41us  215.41us  maxwell_sgemm_128x64_nn

Notar que usa el kernel Maxwell por más que está en Pascal.

9302 GFLOPS.

87% de eficiencia.

¡Alevoso!

Ojo: hay algo fishy, en 2048 me dá 10 TF y en 4096 12 TF ... baaadddd.

Presenter Notes

Bibliografía

NVIDIA, NVIDIA CUDA C Programming Guide, 2018.
David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 2nd ed., Morgan Kaufmann, 2012.
Rob Farber, CUDA Application Design and Development, Morgan Kaufmann, 2011.
NVIDIA, CUDA-Memcheck User Manual, 2018.
NVIDIA, CUDA Profiler User's Guide, 2018.
NVIDIA, CUPTI User's Guide, 2018.
NVIDIA, Parallel Thread Execution ISA, 2018.
NVIDIA, NVIDIA CUDA Compiler Driver NVCC, 2018.
NVIDIA, NVIDIA cuBLAS, 2018.
Hennessy, Patterson, Computer Architecture: A Quantitative Approach, Fifth Edition, Morgan Kaufmann, 2011.

Presenter Notes

Bibliografía

Vasily Volkov, Better performance at lower occupancy, GTC, 2010.
NVIDIA, Kepler Tuning Guide, 2018.
NVIDIA, Maxwell Tuing Guide, 2018.
NVIDIA, Pascal Tuing Guide, 2018.
NVIDIA, Volta Tuing Guide, 2018.

Presenter Notes

La clase que viene

Arquitectura interna GPUs.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Kernel

Host

Mapeo de hilos a celdas

Performance

C2070, resumen

Performance

Consumo de memoria

Performance

Consumo de memoria (casi idem)

Performance

Consumo de memoria (casi idem)

Código

Pregunta

Table of Contents

Help