readme tweaks
This commit is contained in:
@@ -128,43 +128,37 @@ $ pytest
|
|||||||
|
|
||||||
## performance
|
## performance
|
||||||
|
|
||||||
*(NOTE: this guide is not great because I personally spend a lot of my time in Python land and don't have an amazing understanding of a lot of these features and flags. If someone does and is willing to help document and briefly describe some of these and their tradeoffs, I'd welcome a PR)*
|
There are many ways to potentially speed up this code depending on your system. Have a look at the [Makefile](Makefile), which contains a lot of notes. The `make run` command currently uses the `-O3` optimization by default, i.e.:
|
||||||
|
|
||||||
There are many ways to potentially speed up this code depending on your system. Here we document a few together with a high-level guide on what they do. Here's again the default way to compile, but using -O3:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
gcc -O3 -o run run.c -lm
|
gcc -O3 -o run run.c -lm
|
||||||
```
|
```
|
||||||
|
|
||||||
-O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches. Here's a few more to try.
|
-O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches.
|
||||||
|
|
||||||
`-Ofast` Run additional optimizations which may break compliance with the C/IEEE specifications, in addition to `-O3`. See [the GCC docs](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) for more information.
|
To get a much better performance, try to compile with `make runfast`. This turns on the `-Ofast` flag, which includes additional optimizations that may break compliance with the C/IEEE specifications, in addition to `-O3`. See [the GCC docs](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) for more information.
|
||||||
|
|
||||||
`-march=native` Compile the program to use the architecture of the machine you're compiling on rather than a more generic CPU. This may enable additional optimizations and hardware-specific tuning such as improved vector instructions/width.
|
Try `-march=native` to compile the program to use the architecture of the machine you're compiling on rather than a more generic CPU. This may enable additional optimizations and hardware-specific tuning such as improved vector instructions/width.
|
||||||
|
|
||||||
The fastest throughput I saw so far on my MacBook Air (M1) is with:
|
The fastest throughput I saw so far on my MacBook Air (M1) so far is with `make runfast`.
|
||||||
|
|
||||||
```bash
|
|
||||||
gcc -Ofast -o run run.c -lm
|
|
||||||
```
|
|
||||||
|
|
||||||
You can also experiment with replacing `gcc` with `clang`.
|
You can also experiment with replacing `gcc` with `clang`.
|
||||||
|
|
||||||
### OpenMP
|
### OpenMP
|
||||||
Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
|
Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
|
||||||
You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). Then you can compile e.g. like so:
|
You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). I was not able to get improvements from OpenMP on my MacBook, though. Then you can compile with `make runomp`, which does:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
clang -Ofast -fopenmp -march=native run.c -lm -o run
|
clang -Ofast -fopenmp -march=native run.c -lm -o run
|
||||||
```
|
```
|
||||||
|
|
||||||
You can try swapping clang/gcc, and may try to leave out -march=native. However, when you run inference make sure to use OpenMP flags to set the number of threads, e.g.:
|
When you run inference make sure to use OpenMP flags to set the number of threads, e.g.:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
OMP_NUM_THREADS=4 ./run out/model.bin
|
OMP_NUM_THREADS=4 ./run out/model.bin
|
||||||
```
|
```
|
||||||
|
|
||||||
Depending on your system resources you may want to tweak these hyperparameters. (TODO: I am not intimately familiar with OpenMP and its configuration, if someone would like to flesh out this section I would welcome a PR).
|
Depending on your system resources you may want to tweak these hyperparameters and use more threads. But more is not always better, usually this is a bit U shaped.
|
||||||
|
|
||||||
## platforms
|
## platforms
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user