diff --git a/README.md b/README.md index f1c6923..c8e94fd 100644 --- a/README.md +++ b/README.md @@ -144,6 +144,8 @@ The fastest throughput I saw so far on my MacBook Air (M1) so far is with `make You can also experiment with replacing `gcc` with `clang`. +If compiling with gcc, try experimenting with `-funroll-all-loops`, see PR [#183](https://github.com/karpathy/llama2.c/pull/183) + ### OpenMP Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors. You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). I was not able to get improvements from OpenMP on my MacBook, though. Then you can compile with `make runomp`, which does: