Ejemplos de Optimizaciones CPU

Presenter Notes

Resumen:

Límites.
- Intensidad aritmética.
Que cosas si tiene sentido hacer, ejemplos:
- Transposición de matrices.
- Multiplicación matriz por matriz.

Nicolás Wolovick 20140325

Presenter Notes

Límites

Presenter Notes

Problemas y límites

Obtener lo máximo de un núcleo.

Operaciones de punto flotante (GFLOPS).
Ancho de banda de memoria (GBps).

Dependiendo de la aplicación los límites pueden ser:

CPU-bound
Memory-bound

(usualmente memory-bound, excepto tests para lucirse: Top500)

Presenter Notes

Performance pico

Usualmente referido a GFLOPS.

#FPU x Clock x #Core

Supuestos

Ejecución OoO es capaz de tirar 1op por clock.
Hay #FPU unidades de punto flotante independientes.

¿Punto flotante simple f32 o doble f64?

Top500 usa f64.

Presenter Notes

Ejemplos performance pico `f64`

CPU

Alpha 21264 (1 GHz, 2 FPU, 2000): 2 GFLOPS.
Tegra2 (1 GHz, 1 FPU, 2 cores, 2010): 2 GFLOPS.
Core 2 Duo P8880 (2.66 GHz, 2 FPU, 2 cores, 2009): 10.64 GFLOPS.
Core i7 960 (3.2 GHz, 2 FPU, 4 cores, 2009): 26 GFLOPS.
- Core i7 960 con operaciones vectoriales: 102 GFLOPS.
Xeon E5 2680 (2.7 GHz, 4 FPU, 8 cores, 2012): 86 GFLOPS.

GPU

GTX 480: 168 GFLOPS.
Tesla C2075: 515 GFLOPS.

Extraídos de: Alpha [Parello'02], Tegra2 [Ramirez'11], Core i7 960 [Hennessy'11], Xeon E5 2680 [guesstimated], GTX 480 [Wikipedia + 1/8 of f32 GFLOPS], Tesla C2075 [NVIDIA].

Presenter Notes

Ejemplos de ancho de banda de memoria

CPU

Core i7 960 (triple channel, DDR3-1333) : 25.6 GBps.
Intel Core i7 3960X (quad channel, DDR3-1600): 51 GBps.

GPU

GTX 480 (6 canales, 384 bits, GDDR5@924MHz): 177 GBps.
Tesla C2075 (6 canales, 384 bits, GDDR5@1.5GHz): 144GBps.

(¿Cómo se calculará?)

Extraídos de: Core i7 960 [Intel], Core i7 3960X [Anandtech, Intel], GTX 480 [Hennessy'11], Tesla C2075 [NVIDIA].

Presenter Notes

¿Qué hacer?

CPU

Aumentar el ILP, favorecer ejecución OoO.
Disminuir las instrucciones ejecutadas.

Memoria

Aumentar localidad temporal y espacial.
Tanto para caché como para TLB.

Presenter Notes

El modelo: intensidad aritmética

q = flop/bytes

Cuanto más, mejor!
Recordar memory wall.

FIGURE 6.17 Arithmetic intensity, specified as the number of float-point operations to
run the program divided by the number of bytes accessed in main memory [Williams,
Waterman, and Patterson 2009]

Presenter Notes

¿Qué cosas SI tiene sentido hacer?

Presenter Notes

¿Qué cosas si tiene sentido hacer?

Un poco de loop unrolling.
- Desarma dependencias usando registros para lograr más ILP.
- La ganancia no está en ahorrarse un branch (99% de predicción).
- Tienen un límite: register spilling.
Exponer paralelismo dependiente del problema.
Optimizaciones de Caché y TLB.
- Reorganización de loops.
- Cache blocking o loop tiling.
- Alineamiento de memoria.
Indicar que no hay aliasing.

Presenter Notes

Aliasing

Problema de "C".

 1 void updatePtrs(int *ptrA, int *ptrB, int *val) { // se puede optimizar
 2     *ptrA += *val;
 3     *ptrB += *val;
 4 }
 5 
 6 
 7 struct node {
 8     struct node *next, *prev;
 9 };
10 
11 void foo(struct node *n) { // no se puede optimizar
12     n->next->prev->next=n;
13     n->next->next->prev=n;
14 }

Usamos la palabra clave restrict

1 void updatePtrs(int *restrict ptrA, int *restrict ptrB,
2                 int *restrict val) {

Y el compilador puede hacer lo esperable.

Presenter Notes

Alineamiento de memoria

Podemos pedirle al compilador que ponga las cosas alineadas (GNU).

1 float a[L][L] __attribute__((aligned(64))),
2       b[L] __attribute__((aligned(64))),
3       c[L] __attribute__((aligned(64)));

Ningún efecto sobre variables globales.
Originalmente alineadas a 32 bytes.

Usar memalign() para memoria dinámica.
Ejemplo pensando en líneas de caché de 64 bytes.

1 #include <malloc.h>
2 double *a = memalign(64, N*N*sizeof(double))

Tampoco parece producir efectos.

Presenter Notes

Algoritmos caché-aware

Tener en cuenta:

Localidad espacial y temporal.
Parámetros del caché.
- line size.
- associativity.

AoS vs. SoA

Figure 10. Shuffling an AoS into a SoA.

Presenter Notes

Ejemplo

AoS (array of structures)

1 struct point {
2     float dx, dy, dz, dw;
3 };
4 point p[N];
5 
6 for (int i=0; i<N; ++i) {
7     dist = sqrtf(p[i].dx*p[i].dx + p[i].dy*p[i].dy + p[i].dz*p[i].dz + p[i].dw*p[i].dw)
8 }

SoA (structure of arrays)

1 struct p {
2     float dx[N], dy[N], dz[N], dw[N];
3 };
4 
5 for (int i=0; i<N; ++i) {
6     dist = sqrtf(p.dx[i]*p.dx[i] + p.dy[i]*p.dy[i] + p.dz[i]*p.dz[i] + p.dw[i]*p.dw[i])
7 }

Presenter Notes

Ejemplo B = A^t

Presenter Notes

Transposición de matrices

Intensidad aritmética

q = 0 / (2 ∗ L^2) = 0

Memory-bound ʘ_ʘ

Matriz en la memoria

Organización de la matriz en memoria

Presenter Notes

Naïve

 1 #define L (1<<11)
 2 
 3 float a[L][L], b[L][L];
 4 
 5 int main(void) {
 6     unsigned int i=0, j=0;
 7     for (i=0; i<L; ++i)
 8             for (j=0; j<L; ++j)
 9                 b[j][i] = a[i][j];
10     return (int)b[(int)b[0][2]][(int)b[2][0]];
11 }

Presenter Notes

Medición `gcc -O3`

perf stat -r 4 -e instructions,cycles,cache-references,cache-misses ./mtxtransp1

1 Performance counter stats for './mtxtransp1' (4 runs):
2 
3     39.047.948 instructions              #    0,07  insns per cycle          ( +-  0,08% )
4    574.531.905 cycles                    #    0,000 GHz                      ( +-  0,80% )
5      4.958.588 cache-references                                              ( +-  0,10% )
6      4.246.447 cache-misses              #   85,638 % of all cache refs      ( +-  0,02% )
7 
8    0,217418170 seconds time elapsed                                          ( +-  0,94% )

Notar el bajísimo IPC.
Se ven claramente las razones.

Intel Core2 Duo P8800@2.66GHz, gcc-4.7.3, Linux 3.8.0 x86_64

Presenter Notes

Blocking

Usamos cache blocking o loop tiling.

 1 #define L (1<<11)
 2 
 3 const unsigned int BX=1<<4;
 4 const unsigned int BY=1<<4;
 5 
 6 float a[L][L], b[L][L];
 7 
 8 int main(void) {
 9     unsigned int i=0, j=0, k=0, l=0;
10     assert(0==L%BX && 0==L%BY);
11     for (i=0; i<L; i+=BY)
12         for (j=0; j<L; j+=BX)
13             for (k=i; k<i+BY; ++k)
14                 for (l=j; l<j+BX; ++l)
15                     b[l][k] = a[k][l];
16     return (int)b[(int)b[0][2]][(int)b[2][0]];
17 }

Presenter Notes

Medición `gcc -O3`

perf stat -r 4 -e instructions,cycles,cache-references,cache-misses ./mtxtransp2

1 Performance counter stats for './a.out' (4 runs):
2 
3     43.536.821 instructions              #    0,35  insns per cycle          ( +-  1,30% )
4    124.940.642 cycles                    #    0,000 GHz                      ( +-  1,02% )
5      4.238.619 cache-references                                              ( +-  0,10% )
6        566.880 cache-misses              #   13,374 % of all cache refs      ( +-  0,16% )
7 
8    0,050019460 seconds time elapsed                                          ( +-  5,06% )

Speedup: 0.21/0.05 = 4.1x

clap, clap, clap ...

Intel Core2 Duo P8800@2.66GHz, gcc-4.7.3, Linux 3.8.0 x86_64

Presenter Notes

¿Cuál es el mejor tamaño de bloque?

Exploración tamaño de bloque

De los 0.21s, llegamos a 0.031s.
6.77x de speedup!

Presenter Notes

Ejemplo `dgemm`

Presenter Notes

Multiplicación matriz por matriz

Jerarquía BLAS

Dot product: L data, 2L flops.
Matrix-vector multiply: L^2 data, 2L^2 flops.
Matrix-matrix multiply: 2L^2 data, 2L^3 flops.

`dgemm` intensidad aritmética

q = flops/bytes = O(L)

`dgemm` uso de memoria

Memoria total: O(L^2).
Memoria accedida: O(L^3).
Reuso: O(L).

Presenter Notes

Muchísimas mediciones en consola

dgemm.c
dgemm_unroll.c
dgemm_blocking_CoAD.c
dgemm_blocking.f
dgemm_blocking.c
dgemm_blocking_unrolling_CoAD.c
dgemm_blas.c

Análisis de:

Scaling con el tamaño.
IPC.
Cache misses.
Tamaño del blocking.

Presenter Notes

Naïve vs. Blocking vs. Vendor

(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)

Core2, 10GFLOPS performance pico usando los 2 cores. El vendor está al 70% de la performance pico.

Presenter Notes

Espacio de búsqueda del tamaño del bloque

(SDSC, CS260, Matrix Matrix Multiply)

Presenter Notes

`sgemm` y `dgemm` de BLAS3

Problema muy estudiado.
Núcleo de la medición de Top500.
Caso paradigmático: [Parello'02]
- Alpha 21264, 1GHz, 2 FP, 2 GFLOPS-
- A través de 9 iteraciones logran 95% de la performance pico.
- We have also shown that cache tiling, on which a large share of research works focus, only accounts for 12% of the total performance improvement ...
Mucho esfuerzo en los compiladores de vendors.
Infinito esfuerzo en las bibliotecas: Magma, GotoBLAS, OpenBLAS, ATLAS, MLK, ACML.

Presenter Notes

Consejos

Best case: good algorithm, efficient design, obvious code
Tradeoff: speed vs readability, debuggability, maintainability...
Only optimize when needful
Go for low-hanging fruit first: data layouts, libraries, compiler flags
Concentrate on the bottleneck
Concentrate on inner loops
Get correctness (and a test framework) first

(CS5220, David Bindel, Lecture 2: Tiling matrix-matrix multiply, code tuning)

Presenter Notes

Más consejos

Usar buenas herramientas.
Usar bibliotecas.
Flags de compilación.
- -O3: Aggressive optimization
- -march=core2: Tune for specific architecture
- -ftree-vectorize: Automatic use of SSE (supposedly)
- -funroll-loops: Loop unrolling
- -ffast-math: Unsafe floating point optimizations
- profile-guided optimization (PGO).
Prestarle atención al memory layout.
Usar la menor cantidad de memoria posible.
- Bit arrays.
- Mixed precision.
Evitar dependencias falsas.
Loop unrolling para operar sobre registros.

Presenter Notes

Conclusión

Optimizar es complejo

"Therefore my best advice is to avoid loop unrolling. An unrolled loop takes more space in the μop cache, and the advantage of loop unrolling is minimal."

(8.16, Bottlenecks in Sandy Bridge, The microarchitecture of ..., Agner Fog)

"Note that we are not making any absolute predictions on code performance for these implementations, or even relative comparison of their runtimes. Such predictions are very hard to make. However, the above discussion identifies issues that are relevant for a wide range of classical CPUs."

(p.33, Eijkhout, Intro ...)

Presenter Notes

Bibliografía

Presenter Notes

Bibliografía

Charles Severance, Kevin Dowd, "High Performance Computing", Connexions, 2010, CC3.0.
Patterson, Hennessy Computer Organization and Design, Fifth Edition, Morgan Kaufmann, 2012.
Hennessy, Patterson, Computer Architecture: A Quantitative Approach, Fourth Edition, Morgan Kaufmann, 2011.
Chatterjeey, Senz, Cache-Efficient Matrix Transposition, HPCA-6, 2000.
Parello, Temam, Verdun, On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance -- Matrix-Multiply Revisited, SC02.
Agner Fog, Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms, 2012.
Felix von Leitne, Source Code Optimization, 2009.
Mike Acton, Demystifying The Restrict Keyword, 2006.

Presenter Notes

La clase que viene

Ya veré :)

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Supuestos

CPU

GPU

CPU

GPU

CPU

Memoria

AoS vs. SoA

AoS (array of structures)

SoA (structure of arrays)

Intensidad aritmética

Matriz en la memoria

Jerarquía BLAS

dgemm intensidad aritmética

dgemm uso de memoria

Optimizar es complejo

Table of Contents

Help

`dgemm` intensidad aritmética

`dgemm` uso de memoria