From e270c6eb3c8a398f71091ea0ca969a8f0e088834 Mon Sep 17 00:00:00 2001 From: Andrej Date: Tue, 1 Aug 2023 08:59:00 -0700 Subject: [PATCH] Update README.md: add mention of -f unroll loops option for gcc --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index f1c6923..c8e94fd 100644 --- a/README.md +++ b/README.md @@ -144,6 +144,8 @@ The fastest throughput I saw so far on my MacBook Air (M1) so far is with `make You can also experiment with replacing `gcc` with `clang`. +If compiling with gcc, try experimenting with `-funroll-all-loops`, see PR [#183](https://github.com/karpathy/llama2.c/pull/183) + ### OpenMP Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors. You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). I was not able to get improvements from OpenMP on my MacBook, though. Then you can compile with `make runomp`, which does: