Problem:
- exp and sqrt are using double precision for operations which is not
required.
Solution:
- Use expf and sqrtf intead.
Notes:
- Although it's using single precision doesn't seem to affect the
result.
Results: ~ 10% improvement
- before: 940 tok/s
- after: 1020 tok/s
Problem:
- clock is CPU and doesn't work properly with parallel execution.
- perf execution is matmul x weights bound.
Solution:
- use gettimeofday instead.
- utilize openmp to parallelize matmul.
Note:
- if not compiled with -fopenmp the #pragma is ignored and single
execution is performed.
- there are additional env variable to setup for openmp (optinally)
to setup the number of threads, scheduler etc.
Benchmarks:
```
clang -Ofast -march=native run.c -lm -o run // achieved tok/s: 340.878828
clang -Ofast -fopenmp -march=native run.c -lm -o run // achieved tok/s: 524.590164
```