ok this first version works but i don't think is ready to merge, have to think on more

ok this works but is super slow because we are doing all the work in fp32 still
small improvements to comments and warnings and increase header size during model export
2023-08-18 15:44:02 +00:00 · 2023-08-18 03:40:18 +00:00 · 2023-08-17 14:32:22 +00:00 · 2023-08-17 05:56:20 +00:00 · 2023-08-17 04:13:13 +00:00 · 2023-08-16 20:09:32 -07:00
16 changed files with 1109 additions and 426 deletions
@@ -4,10 +4,12 @@ on:
  push:
    branches:
      - master
-    paths: ['.github/workflows/**', '**/Makefile', '**/*.c', '**/*.h']
+    paths: ['.github/workflows/**', '**/Makefile', '**/*.c', '**/*.h', '**/*.py']
  pull_request:
    types: [opened, synchronize, reopened]
-    paths: ['**/Makefile', '**/*.c', '**/*.h']
+    paths: ['**/Makefile', '**/*.c', '**/*.h', '**/*.py']
+  # for manual triggering
+  workflow_dispatch:

 env:
  BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
@@ -15,7 +17,7 @@ env:
 jobs:
  # check basic builds to avoid breaking changes
  ubuntu-focal-make:
-    runs-on: ubuntu-20.04
+    runs-on: ubuntu-latest

    steps:
      - name: Clone
@@ -28,6 +30,16 @@ jobs:
          sudo apt-get update
          sudo apt-get install build-essential -y

+      - name: Set up Python 3.10
+        uses: actions/setup-python@v3
+        with:
+          python-version: "3.10"
+
+      - name: Pip setup
+        run: |
+          python -m pip install --upgrade pip
+          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+
      - name: Build
        id: make_build
        run: |
@@ -38,6 +50,10 @@ jobs:
        run: |
          make runfast

+      - name: Test with pytest
+        run: |
+          pytest
+
  macOS-latest-make:
    runs-on: macos-latest

@@ -52,6 +68,21 @@ jobs:
        run: |
          brew update

+      - name: Set up Python 3.10
+        uses: actions/setup-python@v3
+        with:
+          python-version: "3.10"
+
+      - name: Pip setup
+        run: |
+          python -m pip install --upgrade pip
+          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+
+      - name: Build clang
+        id: make_build_clang
+        run: |
+          make run CC=clang
+
      - name: Build
        id: make_build
        run: |
@@ -62,15 +93,17 @@ jobs:
        run: |
          make runfast

-      - name: Build clang
-        id: make_build_clang
-        run: |
-          make run CC=clang
+      - name: Test with pytest
+        run: pytest
+
+
+

  windows-latest-make:
    runs-on: windows-latest

    strategy:
+      fail-fast: false  #necessary, otherwise the matrix breaks
      matrix:
        arch:
          - amd64
@@ -90,11 +123,30 @@ jobs:
        with:
          arch: ${{ matrix.arch }}

+      - name: Set up Python 3.10
+        if: matrix.arch != 'amd64_arm64'
+        uses: actions/setup-python@v3
+        with:
+          python-version: "3.10"
+
+      - name: Pip setup
+        if: matrix.arch != 'amd64_arm64'
+        run: |
+          python -m pip install --upgrade pip
+          if (Test-Path requirements.txt) {
+            pip install -r requirements.txt
+          }
+
      - name: Build ${{ matrix.arch }}
        id: build_msvc
        run: |
          .\build_msvc.bat

+      #cross-comiled, cannot be run on host
+      - name: Test with pytest
+        if: matrix.arch != 'amd64_arm64'
+        run: pytest
+
  windows-latest-mingw:
    runs-on: windows-latest

@@ -122,3 +174,20 @@ jobs:
        id: build_mingw
        run: |
          make win64
+
+      - name: Set up Python 3.10
+        uses: actions/setup-python@v3
+        with:
+          python-version: "3.10"
+
+      - name: Pip setup
+        shell: powershell
+        run: |
+          python -m pip install --upgrade pip
+          if (Test-Path requirements.txt) {
+            pip install -r requirements.txt
+          }
+
+      - name: Test with pytest
+        shell: powershell
+        run: pytest
@@ -45,6 +45,16 @@ rungnu:
 runompgnu:
 	$(CC) -Ofast -fopenmp -std=gnu11 run.c  -lm  -o run

+# run all tests
+.PHONY: test
+test:
+	pytest
+
+# run only tests for run.c C implementation (is a bit faster if only C code changed)
+.PHONY: testc
+testc:
+	pytest -k runc
+
 .PHONY: clean
 clean:
 	rm -f run
@@ -4,9 +4,11 @@
  <img src="assets/llama_cute.jpg" width="300" height="300" alt="Cute Llama">
 </p>

-With the code in this repo you can train the Llama 2 LLM architecture from scratch in PyTorch, then export the weights to a binary file, and load that into one ~simple 500-line C file ([run.c](run.c)) that inferences the model. Alternatively, you can load, finetune, and inference Meta's Llama 2 (but this is still being actively fleshed out). Hence, this repo is a "fullstack" train + inference solution for Llama 2 LLM, with a focus on minimalism and simplicity. You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough. I recommend looking at the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) paper for inspiration.
+Train the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file ([run.c](run.c)). You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough (ref: [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) paper). This repo is a "fullstack" train + inference solution for Llama 2 LLM, with focus on minimalism and simplicity.

-Please note that this started recently as just a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. I wanted something super minimal so I chose to hard-code the Llama 2 architecture, stick to fp32, and just roll one inference file of pure C with no dependencies.
+As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing.
+
+Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compred to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.

 ## feel the magic

@@ -56,7 +58,9 @@ You can also prompt the model with a prefix or a number of additional command li

 > One day, Lily met a Shoggoth. He was very shy, but was also very generous. Lily said “Hello Shoggy! Can I be your friend?” Shoggy was happy to have a friend and said “Yes, let’s explore the universe together!” So they set off on a journey to explore the universe. As they travelled, Shoggy was happy to explain to Lily about all the wonderful things in the universe. At the end of the day, Lily and Shoggy had gathered lots of wonderful things from the universe, and they both felt very proud. They promised to explore the universe as one big pair and to never stop being generous to each other.

-There is also an even better 110M param model available, see [models](#models). Quick note on sampling, the recommendation for good results is to use `-t 1.0 -p 0.9`, i.e. top-p sampling at 0.9 with temperature 1.0 (this is the default). To control the diversity of samples use either the temperature (i.e. vary `-t` between 0 and 1 and keep top-p off with `-p 0`) or the top-p value (i.e. vary `-p` between 0 and 1 and keep `-t 1`), but not both. Nice explainers on LLM sampling strategies include [this](https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/), [this](https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p) or [this](https://huggingface.co/blog/how-to-generate).
+There is also an even better 110M param model available, see [models](#models).
+
+Quick note on sampling, the recommendation for ~best results is to sample with `-t 1.0 -p 0.9`, i.e. temperature 1.0 (default) but also top-p sampling at 0.9 (default). Intuitively, top-p ensures that tokens with tiny probabilities do not get sampled, so we can't get "unlucky" during sampling, and we are less likely to go "off the rails" afterwards. More generally, to control the diversity of samples use either the temperature (i.e. vary `-t` between 0 and 1 and keep top-p off with `-p 0`) or the top-p value (i.e. vary `-p` between 0 and 1 and keep `-t 1`), but not both. Nice explainers on LLM sampling strategies include [this](https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/), [this](https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p) or [this](https://huggingface.co/blog/how-to-generate).

 ## Meta's Llama 2 models

@@ -83,11 +87,12 @@ base models... ¯\\_(ツ)_/¯. Since we can inference the base model, it should

 For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub [tinyllamas](https://huggingface.co/karpathy/tinyllamas), both in the original PyTorch .pt, and also in the llama2.c format .bin:

-| model | dim | n_layers | n_heads | max context length | parameters | val loss | download
-| --- | --- | --- | --- | --- | --- | --- | --- |
-| OG | 288 | 6 | 6 | 256 | 15M | 1.072 | [stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin) |
-| 42M| 512 | 8 | 8 | 1024 | 42M | 0.847 | [stories42M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin) |
-| 110M| 768 | 12 | 12 | 1024 | 110M | 0.760 | [stories110M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin) |
+| model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| 260K | 64 | 5 | 8 | 4 | 512 | 260K | 1.297 | [stories260K](https://huggingface.co/karpathy/tinyllamas/tree/main/stories260K)
+| OG | 288 | 6 | 6 | 6 | 256 | 15M | 1.072 | [stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin) |
+| 42M| 512 | 8 | 8 | 8 | 1024 | 42M | 0.847 | [stories42M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin) |
+| 110M| 768 | 12 | 12 | 12 | 1024 | 110M | 0.760 | [stories110M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin) |

 You'll notice that the 110M model is equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (`GPT-2 small`), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1/2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute/learned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery (but this is not yet supported in llama2.c).

@@ -130,15 +135,53 @@ Watch the tokens stream by, fun! We can also run the PyTorch inference script fo

 ```bash
 wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt -P out15M
-mv out15M/stories15M.pt out15M/ckpt.pt # sorry the sample script current assumes this directory structure / filename...
-python sample.py --out_dir=out15M
+python sample.py --checkpoint=out15M/stories15M.pt
 ```

-Which gives the same results. More detailed testing will be done in `test_all.py`. Currently you will need two files to test or sample: both the .bin file, and the .ckpt file inside a directory (see `test_all.py` for details). Sorry this is a bit janky right now, I have to think through running the tests without having to download 200MB of data. But run the tests with pytest:
+Which gives the same results.
+
+## custom tokenizers
+
+In everything above, we've assumed the custom Lllama 2 tokenizer with 32,000 tokens. However, in many boutique LLMs, using vocabulary this big might be an overkill. If you have a small application you have in mind, you might be much better off training your own tokenizers. This can make everything nicer - with smaller vocabs your model has fewer parameters (because the token embedding table is a lot smaller), the inference is faster (because there are fewer tokens to predict), and your average sequence length per example could also get smaller (because the compression is a lot more efficient on your data). So let's see how we train a custom tokenizer.
+
+By default, to pretokenize the tinystories dataset we had to run, in order:

-```bash
-$ pytest
 ```
+python tinystories.py download
+python tinystories.py pretokenize
+```
+
+The `pretokenize` stage here loads the Llama 2 tokenizer (vocab size 32,000) and uses it to convert the downloaded text into integers, and saves that to file. We now change this as follows, to train an example 4096-token tokenizer:
+
+```
+python tinystories.py download
+python tinystories.py train_vocab --vocab_size=4096
+python tinystories.py pretokenize --vocab_size=4096
+```
+
+The `train_vocab` stage will call the `train_vocab.sh` script, which calls the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.
+
+A quick note of interest is that vocab size of 4096 trained specifically on tinystories creates integer sequences with about the same sequence length per example as the default Llama 2 tokenizer of 32000 tokens! This means that our custom, tailored tokenizer is a lot better adapted to our specific text, and can compress it very effectively. So our trained models are smaller and faster.
+
+Now that we have pretokenized the dataset with our custom tokenizer, we can train the model. The training script `train.py` doesn't care about the exact tokens, it only cares about the vocabulary size so it can correctly initialize the model. So when training your model, make sure to pass in
+
+```
+python train.py --vocab_source=custom --vocab_size=4096
+```
+
+(The defaults are `llama2` and `32000` respectively, which indicates the default Llama 2 tokenizer). This trains the model. Finally we are ready to run inference with our `run.c` script. For that we need two things. Number one, we have to export our tokenizer in the `.bin` format, do that with:
+
+```
+python tokenizer.py --tokenizer-model=data/tok4096.model
+```
+
+This writes the tokenizer to `data/tok4096.bin`. Now we can run inference, pointing it to this tokenizer using the `-z` flag:
+
+```
+./run out/model.bin -z data/tok4096.bin
+```
+
+This should print the samples. If you leave out the `-z` flag, it will use the default Llama 2 tokenizer, which would generate a good sequence of integers, but they would get translated using a different vocabulary to text, so it would look like gibberish.

 ## performance

@@ -162,7 +205,7 @@ If compiling with gcc, try experimenting with `-funroll-all-loops`, see PR [#183

 ### OpenMP
 Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
-You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). I was not able to get improvements from OpenMP on my MacBook, though. Then you can compile with `make runomp`, which does:
+You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). Then you can compile with `make runomp`, which does:

 ```bash
 clang -Ofast -fopenmp -march=native run.c  -lm  -o run
@@ -182,6 +225,19 @@ On **Windows**, use `build_msvc.bat` in a Visual Studio Command Prompt to build

 On **Centos 7**, **Amazon Linux 2018** use `rungnu` Makefile target: `make rungnu` or `make runompgnu` to use openmp.

+On **Mac**, use clang from brew for openmp build. Install clang as `brew install llvm` and use the installed clang binary to compile with openmp: `make runomp CC=/opt/homebrew/opt/llvm/bin/clang`
+
+## tests
+
+You can run tests simply with pytest:
+
+```bash
+$ pip install pytest
+$ pytest
+```
+
+This will currently invoke two tests inside `test_all.py`, which forward the model in both C and Python for 200 steps and check the output against a known good expected output. The tests currently run in only a few seconds, but will have to download and cache the stories260K models in a temporary `test` directory (only ~2MB download).
+
 ## ack

 I trained the llama2.c storyteller models on a 4X A100 40GB box graciously provided by the excellent [Lambda labs](https://lambdalabs.com/service/gpu-cloud), thank you.
@@ -214,6 +270,7 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
  - [llama2.rs](https://github.com/gaxler/llama2.rs) by @[gaxler](https://github.com/gaxler): a Rust port of this project
  - [llama2.rs](https://github.com/leo-du/llama2.rs) by @[leo-du](https://github.com/leo-du): A Rust port of this project
  - [llama2-rs](https://github.com/danielgrittner/llama2-rs) by @[danielgrittner](https://github.com/danielgrittner): a Rust port of this project
+  - [llama2.rs](https://github.com/lintian06/llama2.rs) by @[lintian06](https://github.com/lintian06): A Rust port of this project
 - Go
  - [go-llama2](https://github.com/tmc/go-llama2) by @[tmc](https://github.com/tmc): a Go port of this project
  - [llama2.go](https://github.com/nikolaydubina/llama2.go) by @[nikolaydubina](https://github.com/nikolaydubina): a Go port of this project
@@ -226,6 +283,7 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
  - [llama2.cpp](https://github.com/leloykun/llama2.cpp) by @[leloykun](https://github.com/leloykun): a C++ port of this project
 - JavaScript
  - [llama2.js](https://github.com/epicure/llama2.js) by @[epicure](https://github.com/epicure): a JavaScript port of this project
+  - [llama2.ts](https://github.com/wizzard0/llama2.ts) by @[oleksandr_now](https://twitter.com/oleksandr_now): a TypeScript port of this project. Full Llama2-7B capable.
  - [llama2.c-emscripten](https://github.com/gohai/llama2.c-emscripten) by @[gohai](https://github.com/gohai): Emscripten (JavaScript) port, based on @ggerganov's initial prototype
 - Zig
  - [llama2.zig](https://github.com/cgbur/llama2.zig) by @[cgbur](https://github.com/cgbur): A Zig port of this project
@@ -243,16 +301,17 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
  - [llama2.py](https://github.com/tairov/llama2.py) by @[tairov](https://github.com/tairov): a simple one file pure Python port of this project with zero dependencies
 - C#
  - [llama2.cs](https://github.com/trrahul/llama2.cs) by @[trrahul](https://github.com/trrahul): a C# port of this project
+- WebAssembly
+  - [icpp-llm](https://github.com/icppWorld/icpp-llm): LLMs for the Internet Computer
 - [llama2.c - Llama 2 Everywhere](https://github.com/trholding/llama2.c) by @[trholding](https://github.com/trholding): Standalone, Bootable & Portable Binary Llama 2
+- [llama2.c-zh - Bilingual Chinese and English](https://github.com/chenyangMl/llama2.c-zh) by @[chenyangMl](https://github.com/chenyangMl): Expand tokenizer to support training and inference in both Chinese and English

 ## unsorted todos

- add multiquery support into run.c
- add custom bpe training code and the ability to train a smaller vocabulary (32K is to much)
+- make it easier to add a new dataset with not too much pain
 - should calculate freq_cis online in the script run.c instead of loading them
 - int4/8 quantization
 - export the model in a more sensible output format with a proper header, etc.
- train a tiny Llama test model (committed to repo) and use it as reference in unit tests
 - support Llama 2 7B Chat models and tune run.c to Chat UI/UX
 - llama2.cu investigate and merge
 - (LoRA) finetuning and export of Llama 2 models
@@ -0,0 +1,113 @@
+"""
+This script exports the Llama 2 weights in llama2c.bin format.
+"""
+import os
+import sys
+import struct
+from pathlib import Path
+import json
+
+import torch
+
+from model import precompute_freqs_cis
+
+
+def export(p, state_dict, filepath='model.bin'):
+    """export the model weights in fp32 into .bin file to be read from C"""
+    f = open(filepath, 'wb')
+
+    def serialize(key):
+        print(f"writing {key}...")
+        t = state_dict[key].contiguous().view(-1).type(torch.float32).numpy()
+        f.write(memoryview(t))
+        del state_dict[key]
+
+    # first write out the header
+    hidden_dim = state_dict['model.layers.0.mlp.gate_proj.weight'].shape[0]
+    p['vocab_size'] = 32000
+    p['max_seq_len'] = 2048
+
+    n_kv_heads = p.get('n_kv_heads') or p['n_heads']
+    header = struct.pack(
+        'iiiiiii',
+        p['dim'], hidden_dim, p['n_layers'], p['n_heads'],
+        n_kv_heads, -p['vocab_size'], p['max_seq_len']
+    )
+    # NOTE ABOVE: -ve vocab_size is indicating that the classifier weights are present
+    # in the checkpoint and should be loaded.
+    f.write(header)
+
+    # next write out the embedding weights
+    print("writing tok_embeddings...")
+    serialize('model.embed_tokens.weight')
+
+    # now all the layers
+    # attention weights
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.input_layernorm.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.q_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.k_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.v_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.o_proj.weight')
+    # ffn weights
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.post_attention_layernorm.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.gate_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.down_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.up_proj.weight')
+
+    # final rmsnorm
+    serialize('model.norm.weight')
+    # freqs_cos, freqs_sin
+    freqs_cos, freqs_sin = precompute_freqs_cis(p['dim'] // p['n_heads'], p['max_seq_len'] * 2)
+    state_dict['freqs_cos'] = freqs_cos[:p['max_seq_len']]
+    state_dict['freqs_sin'] = freqs_sin[:p['max_seq_len']]
+    # check if this requires addtional conversion
+    serialize('freqs_cos')
+    serialize('freqs_sin')
+
+    # finally write the output weights
+    serialize('lm_head.weight')
+
+    f.close()
+    print(f"wrote {filepath}")
+
+
+def concat_weights(models):
+    state_dict = {}
+    for name in list(models[0]):
+        tensors = [model[name] for model in models]
+        if len(tensors) == 1 or len(tensors[0].shape) == 1:
+            state_dict[name] = tensors[0]
+            continue
+        is_axis_1 = (
+            name.startswith('model.embed_tokens.weight')
+            or name.endswith('.self_attn.o_proj.weight')
+            or name.endswith('.mlp.down_proj.weight')
+        )
+        axis = 1 if is_axis_1 else 0
+        state_dict[name] = torch.cat(tensors, dim=axis)
+        for model in models:
+            del model[name]
+    return state_dict
+
+
+def load_and_export(model_path, output_path):
+    params_path = os.path.join(model_path, 'params.json')
+    with open(params_path) as f:
+        params = json.load(f)
+        print(params)
+
+    model_paths = sorted(list(Path(model_path).glob('consolidated.*.pth')))
+    models = [torch.load(p, map_location='cpu') for p in model_paths]
+    state_dict = concat_weights(models)
+    del models
+    export(params, state_dict, output_path)
+
+
+if __name__ == '__main__':
+    if len(sys.argv) == 1:
+        print('[Llama model folder path] [output path]')
+        exit()
+
+    model_path = sys.argv[1]
+    output_path = sys.argv[2]
+    load_and_export(model_path, output_path)
@@ -11,12 +11,13 @@ from torch import nn

@dataclass
 class ModelArgs:
+    # default hyperparameters for the Llama 7B model
    dim: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: Optional[int] = None
-    vocab_size: int = -1  # defined later by tokenizer
-    multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2
+    vocab_size: int = 32000
+    multiple_of: int = 256  # MLP hidden layer size will be multiple of
    norm_eps: float = 1e-5
    max_seq_len: int = 2048
    dropout: float = 0.0
@@ -93,6 +94,7 @@ class Attention(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
+        assert args.n_heads % self.n_kv_heads == 0
        model_parallel_size = 1
        self.n_local_heads = args.n_heads // model_parallel_size
        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
@@ -338,53 +340,125 @@ class Transformer(nn.Module):
        return idx

    def export(self, filepath='model.bin'):
-        """export the model weights in fp32 into .bin file to be read from C"""
-        f = open(filepath, 'wb')
+        """export the model weights in Q8_0 into .bin file to be read from C"""
+        out_file = open(filepath, 'wb')

-        def serialize(t):
+        # find the max group size that fits hidden_dim using backoff
+        group_size = 64 # a good desired group size default
+        while self.params.dim % group_size != 0:
+            group_size //= 2
+        print(f"using group size {group_size} for quantization")
+
+        def serialize_fp32(t):
+            """ writes one fp32 tensor to file """
            d = t.detach().cpu().view(-1).numpy().astype(np.float32)
            b = struct.pack(f'{len(d)}f', *d)
-            f.write(b)
+            out_file.write(b)

-        # first write out the header
-        hidden_dim = self.layers[0].feed_forward.w1.weight.shape[0]
+        def serialize_int8(t):
+            """ writes one int8 tensor to file """
+            d = t.detach().cpu().view(-1).numpy().astype(np.int8)
+            b = struct.pack(f'{len(d)}b', *d)
+            out_file.write(b)
+
+        def quantize_q80(w):
+            """
+            takes a tensor and returns the Q8_0 quantized version
+            i.e. symmetric quantization into int8, range [-127,127]
+            """
+            assert w.numel() % group_size == 0
+            ori_shape = w.shape
+            w = w.float() # convert to float32
+            w = w.reshape(-1, group_size)
+            # find the max in each group
+            wmax = torch.abs(w).max(dim=1).values
+            # calculate the scaling factor such that float = quant * scale
+            scale = wmax / 127.0
+            # scale into range [-127, 127]
+            quant = w / scale[:,None]
+            # round to nearest integer
+            int8val = torch.round(quant).to(torch.int8)
+            # dequantize by rescaling
+            fp32val = (int8val.float() * scale[:,None]).view(-1)
+            fp32valr = fp32val.reshape(-1, group_size)
+            # calculate the max error in each group
+            err = torch.abs(fp32valr - w).max(dim=1).values
+            # find the max error across all groups
+            maxerr = err.max().item()
+            return int8val, scale, maxerr
+
+        # first write out the header. the header will be 256 bytes
+        nbytes = 0
+        # 1) write magic, which will be uint32 of "ak42" in ASCII
+        out_file.write(struct.pack('I', 0x616b3432))
+        nbytes += 4
+        # 2) write version, which will be int
+        out_file.write(struct.pack('i', 1))
+        nbytes += 4
+        # 3) write the params, which will be 7 ints
        p = self.params
+        hidden_dim = self.layers[0].feed_forward.w1.weight.shape[0]
        n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
        header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
                                       n_kv_heads, p.vocab_size, p.max_seq_len)
-        f.write(header)
+        out_file.write(header)
+        nbytes += 7*4
+        # 4) write some other flags
+        shared_classifier = 1 # we do share a classifier, write flag as a byte
+        out_file.write(struct.pack('B', shared_classifier))
+        nbytes += 1
+        out_file.write(struct.pack('i', group_size)) # group size used for quantization
+        nbytes += 4
+        pad = 256 - nbytes # pad the rest with zeros
+        assert pad >= 0
+        out_file.write(b'\0' * pad)
+        # now that the header is done, let's write out the model

-        # next write out the embedding weights
-        serialize(self.tok_embeddings.weight)
+        # first let's write out all the params that we are keeping in fp32: the norms
+        for layer in self.layers: # attention norms
+            serialize_fp32(layer.attention_norm.weight)
+        for layer in self.layers: # MLP norms
+            serialize_fp32(layer.ffn_norm.weight)
+        serialize_fp32(self.norm.weight) # final pre-classifier norm

-        # now all the layers
-        # attention weights
-        for layer in self.layers:
-            serialize(layer.attention_norm.weight)
-        for layer in self.layers:
-            serialize(layer.attention.wq.weight)
-        for layer in self.layers:
-            serialize(layer.attention.wk.weight)
-        for layer in self.layers:
-            serialize(layer.attention.wv.weight)
-        for layer in self.layers:
-            serialize(layer.attention.wo.weight)
-        # ffn weights
-        for layer in self.layers:
-            serialize(layer.ffn_norm.weight)
-        for layer in self.layers:
-            serialize(layer.feed_forward.w1.weight)
-        for layer in self.layers:
-            serialize(layer.feed_forward.w2.weight)
-        for layer in self.layers:
-            serialize(layer.feed_forward.w3.weight)
-        # final rmsnorm
-        serialize(self.norm.weight)
-        # note: no need to write final classifier weights due to weight sharing
-        # freqs_cis
-        serialize(self.freqs_cos[:p.max_seq_len])
-        serialize(self.freqs_sin[:p.max_seq_len])
+        # now let's write out all the params that we are quantizing to Q8_0
+        # note we skip classifier weights, which are shared with the embedding
+        weights = [
+            self.tok_embeddings.weight,
+            *[layer.attention.wq.weight for layer in self.layers],
+            *[layer.attention.wk.weight for layer in self.layers],
+            *[layer.attention.wv.weight for layer in self.layers],
+            *[layer.attention.wo.weight for layer in self.layers],
+            *[layer.feed_forward.w1.weight for layer in self.layers],
+            *[layer.feed_forward.w2.weight for layer in self.layers],
+            *[layer.feed_forward.w3.weight for layer in self.layers],
+        ]
+
+        ew = []
+        scales = []
+        for i, w in enumerate(weights):
+            assert w.numel() % group_size == 0, f"weight {i} has numel {w.numel()}, not a multiple of group_size {group_size}"
+
+            # quantize this weight
+            q, s, err = quantize_q80(w)
+
+            # save to file
+            serialize_int8(q) # save the tensor in int8
+            scales.append(s)  # we'll do all the scales after all the qs
+
+            # logging
+            ew.append((err, w.shape))
+            print(f"{i+1}/{len(weights)} quantized {tuple(w.shape)} to Q8_0 with max error {err}")
+
+        # save the scaling factors in fp32 here
+        # this is done to keep all the weights contiquous, making pointer arithmetic easier in C
+        for s in scales:
+            serialize_fp32(s)
+
+        # print the highest error across all weights, should be very small, e.g. O(~0.001)
+        ew.sort(reverse=True)
+        print(f"max quantization group error across all weights: {ew[0][0]}")

        # write to binary file
-        f.close()
+        out_file.close()
        print(f"wrote {filepath}")
@@ -2,7 +2,6 @@ numpy==1.23.5
 pytest==7.4.0
 Requests==2.31.0
 sentencepiece==0.1.99
-tiktoken==0.3.3
 torch==2.0.1
 tqdm==4.64.1
 wandb==0.15.5
@@ -1,15 +1,9 @@
-/*
-Inference for Llama-2 Transformer model in pure C.
-
-Example compile: (see README for more details)
-$ gcc -O3 -o run run.c -lm
-
-Then run with:
-$ ./run
-*/
+/* Inference for Llama-2 Transformer model in pure C */

+#include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <ctype.h>
 #include <time.h>
 #include <math.h>
 #include <string.h>
@@ -20,41 +14,49 @@ $ ./run
    #include <unistd.h>
    #include <sys/mman.h>
 #endif
+
+// ----------------------------------------------------------------------------
+// Globals
+
+int GS = 0;  // group size global for quantization of weights
+
 // ----------------------------------------------------------------------------
 // Transformer and RunState structs, and related memory management

 typedef struct {
-    int dim; // transformer dimension
-    int hidden_dim; // for ffn layers
-    int n_layers; // number of layers
-    int n_heads; // number of query heads
-    int n_kv_heads; // number of key/value heads (can be < query heads because of multiquery)
-    int vocab_size; // vocabulary size, usually 256 (byte-level)
-    int seq_len; // max sequence length
+    int dim;        // transformer dimension
+    int hidden_dim; // dimension of the inner layer in the MLP
+    int n_layers;   // number of layers
+    int n_heads;    // number of query heads
+    int n_kv_heads; // number of key & value heads (can be < query heads because of multiquery)
+    int vocab_size; // vocabulary size (size of the classifier weights)
+    int seq_len;    // max sequence length the model was trained with
 } Config;

+typedef struct {
+    int8_t* q;    // quantized values
+    float* s; // scaling factors
+} QuantizedTensor;
+
 typedef struct {
    // token embedding table
-    float* token_embedding_table;    // (vocab_size, dim)
+    QuantizedTensor token_embedding_table; // (vocab_size, dim)
    // weights for rmsnorms
    float* rms_att_weight; // (layer, dim) rmsnorm weights
    float* rms_ffn_weight; // (layer, dim)
-    // weights for matmuls
-    float* wq; // (layer, dim, dim)
-    float* wk; // (layer, dim, dim)
-    float* wv; // (layer, dim, dim)
-    float* wo; // (layer, dim, dim)
+    // weights for matmuls. note dim == n_heads * head_size
+    QuantizedTensor wq; // (layer, dim, n_heads * head_size)
+    QuantizedTensor wk; // (layer, dim, n_kv_heads * head_size)
+    QuantizedTensor wv; // (layer, dim, n_kv_heads * head_size)
+    QuantizedTensor wo; // (layer, n_heads * head_size, dim)
    // weights for ffn
-    float* w1; // (layer, hidden_dim, dim)
-    float* w2; // (layer, dim, hidden_dim)
-    float* w3; // (layer, hidden_dim, dim)
+    QuantizedTensor w1; // (layer, hidden_dim, dim)
+    QuantizedTensor w2; // (layer, dim, hidden_dim)
+    QuantizedTensor w3; // (layer, hidden_dim, dim)
    // final rmsnorm
    float* rms_final_weight; // (dim,)
-    // freq_cis for RoPE relatively positional embeddings
-    float* freq_cis_real; // (seq_len, head_size/2)
-    float* freq_cis_imag; // (seq_len, head_size/2)
    // (optional) classifier weights for the logits, on the last layer
-    float* wcls;
+    QuantizedTensor wcls; // (dim, vocab_size)
 } TransformerWeights;

 typedef struct {
@@ -69,6 +71,8 @@ typedef struct {
    float *xb2; // an additional buffer just for convenience (dim,)
    float *hb; // buffer for hidden dimension in the ffn (hidden_dim,)
    float *hb2; // buffer for hidden dimension in the ffn (hidden_dim,)
+    QuantizedTensor xq; // quantized x (dim,)
+    QuantizedTensor hq; // quantized hb (hidden_dim,)
    float *q; // query (dim,)
    float *k; // key (dim,)
    float *v; // value (dim,)
@@ -82,19 +86,22 @@ typedef struct {

 void malloc_run_state(RunState* s, Config* p) {
    // we calloc instead of malloc to keep valgrind happy
+    int kv_dim = (p->dim * p->n_kv_heads) / p->n_heads;
    s->x = calloc(p->dim, sizeof(float));
    s->xb = calloc(p->dim, sizeof(float));
    s->xb2 = calloc(p->dim, sizeof(float));
    s->hb = calloc(p->hidden_dim, sizeof(float));
    s->hb2 = calloc(p->hidden_dim, sizeof(float));
+    s->xq = (QuantizedTensor) { .q = calloc(p->dim, sizeof(int8_t)), .s = calloc(p->dim, sizeof(float)) };
+    s->hq = (QuantizedTensor) { .q = calloc(p->hidden_dim, sizeof(int8_t)), .s = calloc(p->hidden_dim, sizeof(float)) };
    s->q = calloc(p->dim, sizeof(float));
-    s->k = calloc(p->dim, sizeof(float));
-    s->v = calloc(p->dim, sizeof(float));
+    s->k = calloc(kv_dim, sizeof(float));
+    s->v = calloc(kv_dim, sizeof(float));
    s->att = calloc(p->n_heads * p->seq_len, sizeof(float));
    s->logits = calloc(p->vocab_size, sizeof(float));
    s->probindex = calloc(p->vocab_size, sizeof(ProbIndex));
-    s->key_cache = calloc(p->n_layers * p->seq_len * p->dim, sizeof(float));
-    s->value_cache = calloc(p->n_layers * p->seq_len * p->dim, sizeof(float));
+    s->key_cache = calloc(p->n_layers * p->seq_len * kv_dim, sizeof(float));
+    s->value_cache = calloc(p->n_layers * p->seq_len * kv_dim, sizeof(float));
    // ensure all mallocs went fine
    if (!s->x || !s->xb || !s->xb2 || !s->hb || !s->hb2 || !s->q
     || !s->k || !s->v || !s->att || !s->logits || !s->key_cache
@@ -110,6 +117,10 @@ void free_run_state(RunState* s) {
    free(s->xb2);
    free(s->hb);
    free(s->hb2);
+    free(s->xq.q);
+    free(s->xq.s);
+    free(s->hq.q);
+    free(s->hq.s);
    free(s->q);
    free(s->k);
    free(s->v);
@@ -123,47 +134,72 @@ void free_run_state(RunState* s) {
 // ----------------------------------------------------------------------------
 // initialization: read from checkpoint

-void checkpoint_init_weights(TransformerWeights *w, Config* p, float* f, int shared_weights) {
-    float* ptr = f;
-    w->token_embedding_table = ptr;
-    ptr += p->vocab_size * p->dim;
-    w->rms_att_weight = ptr;
-    ptr += p->n_layers * p->dim;
-    w->wq = ptr;
-    ptr += p->n_layers * p->dim * p->dim;
-    w->wk = ptr;
-    ptr += p->n_layers * p->dim * p->dim;
-    w->wv = ptr;
-    ptr += p->n_layers * p->dim * p->dim;
-    w->wo = ptr;
-    ptr += p->n_layers * p->dim * p->dim;
-    w->rms_ffn_weight = ptr;
-    ptr += p->n_layers * p->dim;
-    w->w1 = ptr;
-    ptr += p->n_layers * p->dim * p->hidden_dim;
-    w->w2 = ptr;
-    ptr += p->n_layers * p->hidden_dim * p->dim;
-    w->w3 = ptr;
-    ptr += p->n_layers * p->dim * p->hidden_dim;
-    w->rms_final_weight = ptr;
-    ptr += p->dim;
-    w->freq_cis_real = ptr;
+void checkpoint_init_weights(TransformerWeights *w, Config* p, void* ptr, uint8_t shared_classifier) {
    int head_size = p->dim / p->n_heads;
-    ptr += p->seq_len * head_size / 2;
-    w->freq_cis_imag = ptr;
-    ptr += p->seq_len * head_size / 2;
-    w->wcls = shared_weights ? w->token_embedding_table : ptr;
+
+    // first are the parameters that are kept in fp32 (the rmsnorm (1D) weights)
+    float* fptr = (float*) ptr; // cast our pointer to float*
+    w->rms_att_weight = fptr;
+    fptr += p->n_layers * p->dim;
+    w->rms_ffn_weight = fptr;
+    fptr += p->n_layers * p->dim;
+    w->rms_final_weight = fptr;
+    fptr += p->dim;
+
+    // now read all the quantized weights
+    int8_t* qptr = (int8_t*) fptr; // now cast the pointer to int8_t*
+    w->token_embedding_table.q = qptr;
+    qptr += p->vocab_size * p->dim;
+    w->wq.q = qptr;
+    qptr += p->n_layers * p->dim * (p->n_heads * head_size);
+    w->wk.q = qptr;
+    qptr += p->n_layers * p->dim * (p->n_kv_heads * head_size);
+    w->wv.q = qptr;
+    qptr += p->n_layers * p->dim * (p->n_kv_heads * head_size);
+    w->wo.q = qptr;
+    qptr += p->n_layers * (p->n_heads * head_size) * p->dim;
+    w->w1.q = qptr;
+    qptr += p->n_layers * p->dim * p->hidden_dim;
+    w->w2.q = qptr;
+    qptr += p->n_layers * p->hidden_dim * p->dim;
+    w->w3.q = qptr;
+    qptr += p->n_layers * p->dim * p->hidden_dim;
+    if (shared_classifier) {
+        w->wcls.q = w->token_embedding_table.q;
+    } else {
+        w->wcls.q = qptr;
+        qptr += p->dim * p->vocab_size;
+    }
+
+    // and finally all the associated scaling factors
+    float* sptr = (float*) qptr; // cast pointer back to float*
+    w->token_embedding_table.s = sptr;
+    sptr += p->vocab_size * p->dim / GS;
+    w->wq.s = sptr;
+    sptr += p->n_layers * p->dim * (p->n_heads * head_size) / GS;
+    w->wk.s = sptr;
+    sptr += p->n_layers * p->dim * (p->n_kv_heads * head_size) / GS;
+    w->wv.s = sptr;
+    sptr += p->n_layers * p->dim * (p->n_kv_heads * head_size) / GS;
+    w->wo.s = sptr;
+    sptr += p->n_layers * (p->n_heads * head_size) * p->dim / GS;
+    w->w1.s = sptr;
+    sptr += p->n_layers * p->dim * p->hidden_dim / GS;
+    w->w2.s = sptr;
+    sptr += p->n_layers * p->hidden_dim * p->dim / GS;
+    w->w3.s = sptr;
+    sptr += p->n_layers * p->dim * p->hidden_dim / GS;
+    if (shared_classifier) {
+        w->wcls.s = w->token_embedding_table.s;
+    } else {
+        w->wcls.s = sptr;
+        sptr += p->dim * p->vocab_size / GS;
+    }
 }

 // ----------------------------------------------------------------------------
 // neural net blocks

-void accum(float *a, float *b, int size) {
-    for (int i = 0; i < size; i++) {
-        a[i] += b[i];
-    }
-}
-
 void rmsnorm(float* o, float* x, float* weight, int size) {
    // calculate sum of squares
    float ss = 0.0f;
@@ -199,35 +235,80 @@ void softmax(float* x, int size) {
    }
 }

-void matmul(float* xout, float* x, float* w, int n, int d) {
+void matmul(float* xout, int8_t* xq, float* xs, int8_t* wq, float* ws, int n, int d) {
    // W (d,n) @ x (n,) -> xout (d,)
    // by far the most amount of time is spent inside this little function
+    // inputs to this function are both quantized
+
    int i;
    #pragma omp parallel for private(i)
    for (i = 0; i < d; i++) {
+
        float val = 0.0f;
-        for (int j = 0; j < n; j++) {
-            val += w[i * n + j] * x[j];
+        int32_t ival = 0;
+        int in = i * n;
+
+        // do the matmul in groups of GS
+        int j;
+        for (j = 0; j <= n - GS; j += GS) {
+            for (int k = 0; k < GS; k++) {
+                ival += ((int32_t) xq[j + k]) * ((int32_t) wq[in + j + k]);
+            }
+            val += ((float) ival) * ws[(in + j) / GS] * xs[j / GS];
+            ival = 0;
        }
+
        xout[i] = val;
    }
 }

+void dequantize(int8_t* q, float* s, float* x, int n) {
+    for (int i = 0; i < n; i++) {
+        x[i] = q[i] * s[i / GS];
+    }
+}
+
+void quantize(float* x, int8_t* q, float* s, int n) {
+    int num_groups = n / GS;
+    float Q_MAX = 127.0f;
+
+    for (int group = 0; group < num_groups; group++) {
+
+        // find the max absolute value in the current group
+        float wmax = 0.0;
+        for (int i = 0; i < GS; i++) {
+            float val = fabs(x[group * GS + i]);
+            if (val > wmax) {
+                wmax = val;
+            }
+        }
+
+        // calculate and write the scaling factor
+        float scale = wmax / Q_MAX;
+        s[group] = scale;
+
+        // calculate and write the quantized values
+        for (int i = 0; i < GS; i++) {
+            float quant_value = x[group * GS + i] / scale; // scale
+            int8_t quantized = (int8_t) round(quant_value); // round and clamp
+            q[group * GS + i] = quantized;
+        }
+    }
+}
+
 void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights* w) {

    // a few convenience variables
    float *x = s->x;
    int dim = p->dim;
+    int kv_dim = (p->dim * p->n_kv_heads) / p->n_heads;
+    int kv_mul = p->n_heads / p->n_kv_heads; // integer multiplier of the kv sharing in multiquery
    int hidden_dim =  p->hidden_dim;
    int head_size = dim / p->n_heads;

-    // copy the token embedding into x
-    float* content_row = &(w->token_embedding_table[token * dim]);
-    memcpy(x, content_row, dim*sizeof(*x));
-
-    // pluck out the "pos" row of freq_cis_real and freq_cis_imag
-    float* freq_cis_real_row = w->freq_cis_real + pos * head_size / 2;
-    float* freq_cis_imag_row = w->freq_cis_imag + pos * head_size / 2;
+    // dequantize the token embedding into a float x
+    QuantizedTensor tok = w->token_embedding_table;
+    dequantize(tok.q + token * dim, tok.s + token * dim / GS, x, dim);

    // forward all the layers
    for(int l = 0; l < p->n_layers; l++) {
@@ -236,30 +317,34 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
        rmsnorm(s->xb, x, w->rms_att_weight + l*dim, dim);

        // qkv matmuls for this position
-        matmul(s->q, s->xb, w->wq + l*dim*dim, dim, dim);
-        matmul(s->k, s->xb, w->wk + l*dim*dim, dim, dim);
-        matmul(s->v, s->xb, w->wv + l*dim*dim, dim, dim);
+        quantize(s->xb, s->xq.q, s->xq.s, dim);
+        matmul(s->q, s->xq.q, s->xq.s, w->wq.q + l*dim*dim,    w->wq.s + l*dim*dim/GS, dim, dim);
+        matmul(s->k, s->xq.q, s->xq.s, w->wk.q + l*dim*kv_dim, w->wk.s + l*dim*kv_dim/GS, dim, kv_dim);
+        matmul(s->v, s->xq.q, s->xq.s, w->wv.q + l*dim*kv_dim, w->wv.s + l*dim*kv_dim/GS, dim, kv_dim);

-        // RoPE relative positional encoding: complex-valued rotate q and k by freq_cis in each head
+        // RoPE relative positional encoding: complex-valued rotate q and k in each head
        for (int i = 0; i < dim; i+=2) {
-            float q0 = s->q[i];
-            float q1 = s->q[i+1];
-            float k0 = s->k[i];
-            float k1 = s->k[i+1];
-            float fcr = freq_cis_real_row[(i % head_size) / 2];
-            float fci = freq_cis_imag_row[(i % head_size) / 2];
-            s->q[i]   = q0 * fcr - q1 * fci;
-            s->q[i+1] = q0 * fci + q1 * fcr;
-            s->k[i]   = k0 * fcr - k1 * fci;
-            s->k[i+1] = k0 * fci + k1 * fcr;
+            int head_dim = i % head_size;
+            float freq = 1.0f / powf(10000.0f, head_dim / (float)head_size);
+            float val = pos * freq;
+            float fcr = cosf(val);
+            float fci = sinf(val);
+            int rotn = i < kv_dim ? 2 : 1; // how many vectors? 2 = q & k, 1 = q only
+            for (int v = 0; v < rotn; v++) {
+                float* vec = v == 0 ? s->q : s->k; // the vector to rotate (query or key)
+                float v0 = vec[i];
+                float v1 = vec[i+1];
+                vec[i]   = v0 * fcr - v1 * fci;
+                vec[i+1] = v0 * fci + v1 * fcr;
+            }
        }

        // save key,value at this time step (pos) to our kv cache
-        int loff = l * p->seq_len * dim; // kv cache layer offset for convenience
-        float* key_cache_row = s->key_cache + loff + pos * dim;
-        float* value_cache_row = s->value_cache + loff + pos * dim;
-        memcpy(key_cache_row, s->k, dim*sizeof(*key_cache_row));
-        memcpy(value_cache_row, s->v, dim*sizeof(*value_cache_row));
+        int loff = l * p->seq_len * kv_dim; // kv cache layer offset for convenience
+        float* key_cache_row = s->key_cache + loff + pos * kv_dim;
+        float* value_cache_row = s->value_cache + loff + pos * kv_dim;
+        memcpy(key_cache_row, s->k, kv_dim * sizeof(*key_cache_row));
+        memcpy(value_cache_row, s->v, kv_dim * sizeof(*value_cache_row));

        // multihead attention. iterate over all heads
        int h;
@@ -272,7 +357,7 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
            // iterate over all timesteps, including the current one
            for (int t = 0; t <= pos; t++) {
                // get the key vector for this head and at this timestep
-                float* k = s->key_cache + loff + t * dim + h * head_size;
+                float* k = s->key_cache + loff + t * kv_dim + (h / kv_mul) * head_size;
                // calculate the attention score as the dot product of q and k
                float score = 0.0f;
                for (int i = 0; i < head_size; i++) {
@@ -291,7 +376,7 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
            memset(xb, 0, head_size * sizeof(float));
            for (int t = 0; t <= pos; t++) {
                // get the value vector for this head and at this timestep
-                float* v = s->value_cache + loff + t * dim + h * head_size;
+                float* v = s->value_cache + loff + t * kv_dim + (h / kv_mul) * head_size;
                // get the attention weight for this timestep
                float a = att[t];
                // accumulate the weighted value into xb
@@ -302,18 +387,22 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
        }

        // final matmul to get the output of the attention
-        matmul(s->xb2, s->xb, w->wo + l*dim*dim, dim, dim);
+        quantize(s->xb, s->xq.q, s->xq.s, dim);
+        matmul(s->xb2, s->xq.q, s->xq.s, w->wo.q + l*dim*dim, w->wo.s + l*dim*dim/GS, dim, dim);

        // residual connection back into x
-        accum(x, s->xb2, dim);
+        for (int i = 0; i < dim; i++) {
+            x[i] += s->xb2[i];
+        }

        // ffn rmsnorm
        rmsnorm(s->xb, x, w->rms_ffn_weight + l*dim, dim);

        // Now for FFN in PyTorch we have: self.w2(F.silu(self.w1(x)) * self.w3(x))
        // first calculate self.w1(x) and self.w3(x)
-        matmul(s->hb, s->xb, w->w1 + l*dim*hidden_dim, dim, hidden_dim);
-        matmul(s->hb2, s->xb, w->w3 + l*dim*hidden_dim, dim, hidden_dim);
+        quantize(s->xb, s->xq.q, s->xq.s, dim);
+        matmul(s->hb, s->xq.q, s->xq.s, w->w1.q + l*dim*hidden_dim, w->w1.s + l*dim*hidden_dim/GS, dim, hidden_dim);
+        matmul(s->hb2, s->xq.q, s->xq.s, w->w3.q + l*dim*hidden_dim, w->w3.s + l*dim*hidden_dim/GS, dim, hidden_dim);

        // F.silu; silu(x)=x*σ(x),where σ(x) is the logistic sigmoid
        for (int i = 0; i < hidden_dim; i++) {
@@ -326,45 +415,107 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
        }

        // final matmul to get the output of the ffn
-        matmul(s->xb, s->hb, w->w2 + l*dim*hidden_dim, hidden_dim, dim);
+        quantize(s->hb, s->hq.q, s->hq.s, hidden_dim);
+        matmul(s->xb, s->hq.q, s->hq.s, w->w2.q + l*dim*hidden_dim, w->w2.s + l*dim*hidden_dim/GS, hidden_dim, dim);

        // residual connection
-        accum(x, s->xb, dim);
+        for (int i = 0; i < dim; i++) {
+            x[i] += s->xb[i];
+        }
    }

    // final rmsnorm
    rmsnorm(x, x, w->rms_final_weight, dim);

    // classifier into logits
-    matmul(s->logits, x, w->wcls, p->dim, p->vocab_size);
+    quantize(x, s->xq.q, s->xq.s, dim);
+    matmul(s->logits, s->xq.q, s->xq.s, w->wcls.q, w->wcls.s, dim, p->vocab_size);
 }

 // ----------------------------------------------------------------------------
 // byte pair encoding (BPE) tokenizer, encodes strings into tokens so we can prompt

-int str_lookup(char *str, char **vocab, int vocab_size) {
-    // find the first perfect match for str in vocab, return its index or -1 if not found
-    for (int i = 0; i < vocab_size; i++) {
-        if (strcmp(str, vocab[i]) == 0) {
-            return i;
-        }
-    }
-    return -1;
+typedef struct {
+    char *str;
+    int id;
+} TokenIndex;
+
+int compare_tokens(const void *a, const void *b) {
+    return strcmp(((TokenIndex*)a)->str, ((TokenIndex*)b)->str);
+}
+
+int str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) {
+    // efficiently find the perfect match for str in vocab, return its index or -1 if not found
+    TokenIndex tok = { .str = str }; // acts as the key to search for
+    TokenIndex *res = bsearch(&tok, sorted_vocab, vocab_size, sizeof(TokenIndex), compare_tokens);
+    return res != NULL ? res->id : -1;
 }

 void bpe_encode(char *text, char **vocab, float *vocab_scores, int vocab_size, unsigned int max_token_length, int *tokens, int *n_tokens) {

-    // a temporary buffer to merge two consecutive tokens
-    char* str_buffer = malloc((max_token_length*2+1) * sizeof(char)); // *2 for concat, +1 for null terminator
+    // sort vocabulary
+    TokenIndex *sorted_vocab = malloc(vocab_size * sizeof(TokenIndex));
+    for (int i = 0; i < vocab_size; i++) {
+        sorted_vocab[i].str = vocab[i];
+        sorted_vocab[i].id = i;
+    }
+    qsort(sorted_vocab, vocab_size, sizeof(TokenIndex), compare_tokens);

-    // first encode every individual byte in the input string
-    *n_tokens = 0; // the number of tokens
+    // create a temporary buffer that will store merge candidates of always two consecutive tokens
+    char* str_buffer = malloc((max_token_length*2 +1 +2) * sizeof(char)); // *2 for concat, +1 for null terminator +2 for UTF8 (in case max_token_lenght is 1)
+    size_t str_len = 0;
+
+    // add_dummy_prefix is true by default
+    tokens[0] = str_lookup(" ", sorted_vocab, vocab_size);
+    *n_tokens = 1; // the number of tokens
+
+    // Okay UTF-8 time. This will get messy. Here is the reference from Wikipedia:
+    // Code point ↔ UTF-8 conversion
+    // First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
+    // U+0000	U+007F	    0xxxxxxx
+    // U+0080	U+07FF	    110xxxxx	10xxxxxx
+    // U+0800	U+FFFF	    1110xxxx	10xxxxxx	10xxxxxx
+    // U+10000	U+10FFFF    11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
+
+    // process the raw (UTF-8) byte sequence of the input string
    for (char *c = text; *c != '\0'; c++) {
-        sprintf(str_buffer, "%c", *c);
-        int id = str_lookup(str_buffer, vocab, vocab_size);
-        if (id == -1) { fprintf(stderr, "not good\n"); exit(EXIT_FAILURE); }
-        tokens[*n_tokens] = id;
-        (*n_tokens)++;
+
+        // reset buffer if the current byte is ASCII or a leading byte
+        // 0xC0 is 11000000, so (*c & 0xC0) keeps the first 2 bits and zeros the rest
+        // 0x80 is 10000000
+        // in UTF-8, all continuation bytes start with "10" in first two bits
+        // so in English this is: "if this byte is not a continuation byte"
+        if ((*c & 0xC0) != 0x80) {
+            // this byte must be either a leading byte (11...) or an ASCII char (0x...)
+            // => reset our location, as we're starting a new UTF-8 codepoint
+            str_len = 0;
+        }
+
+        // append the current byte to the buffer
+        str_buffer[str_len++] = *c; // ++ is post-increment, incremented after this line
+        str_buffer[str_len] = '\0';
+
+        // while the next character is a continuation byte, continue appending
+        // but if there are too many of them, just stop to avoid overruning str_buffer size.
+        if ((*(c+1) & 0xC0) == 0x80 && str_len < 4) {
+            continue;
+        }
+
+        // ok c+1 is not a continuation byte, so we've read in a full codepoint
+        int id = str_lookup(str_buffer, sorted_vocab, vocab_size);
+
+        if (id != -1) {
+            // we found this codepoint in vocab, add it as a token
+            tokens[(*n_tokens)++] = id;
+        } else {
+            // byte_fallback encoding: just encode each byte as a token
+            // +3 is here because the first 3 vocab elements are <unk>, <s>, </s>
+            // so the individual bytes only start at index 3
+            for (int i=0; i < str_len; i++) {
+                tokens[(*n_tokens)++] = (unsigned char)str_buffer[i] + 3;
+            }
+        }
+        str_len = 0; // protect against a sequence of stray UTF8 continuation bytes
    }

    // merge the best consecutive pair each iteration, according the scores in vocab_scores
@@ -376,7 +527,7 @@ void bpe_encode(char *text, char **vocab, float *vocab_scores, int vocab_size, u
        for (int i=0; i < (*n_tokens-1); i++) {
            // check if we can merge the pair (tokens[i], tokens[i+1])
            sprintf(str_buffer, "%s%s", vocab[tokens[i]], vocab[tokens[i+1]]);
-            int id = str_lookup(str_buffer, vocab, vocab_size);
+            int id = str_lookup(str_buffer, sorted_vocab, vocab_size);
            if (id != -1 && vocab_scores[id] > best_score) {
                // this merge pair exists in vocab! record its score and position
                best_score = vocab_scores[id];
@@ -399,6 +550,7 @@ void bpe_encode(char *text, char **vocab, float *vocab_scores, int vocab_size, u
    }

    free(str_buffer);
+    free(sorted_vocab);
 }

 // ----------------------------------------------------------------------------
@@ -465,17 +617,24 @@ int sample_topp(float* probabilities, int n, float topp, ProbIndex* probindex) {
    // tokens that exceed probability topp. This way we never sample tokens that
    // have very low probabilities and are less likely to go "off the rails".

+    int n0 = 0;
    // quicksort indices in descending order of probabilities
+    // values smaller than (1 - topp) / (n - 1) cannot be part of the result
+    // so for efficiency we crop these out as candidates before sorting
+    const float cutoff = (1.0f - topp) / (n - 1);
    for (int i = 0; i < n; i++) {
-        probindex[i].index = i;
-        probindex[i].prob = probabilities[i];
+        if (probabilities[i] >= cutoff) {
+            probindex[n0].index = i;
+            probindex[n0].prob = probabilities[i];
+            n0++;
+        }
    }
-    qsort(probindex, n, sizeof(ProbIndex), compare);
+    qsort(probindex, n0, sizeof(ProbIndex), compare);

    // truncate the list where cumulative probability exceeds topp
    float cumulative_prob = 0.0f;
-    int last_idx = 0;
-    for (int i = 0; i < n; i++) {
+    int last_idx = n0 - 1; // in case of rounding errors consider all elements
+    for (int i = 0; i < n0; i++) {
        cumulative_prob += probindex[i].prob;
        if (cumulative_prob > topp) {
            last_idx = i;
@@ -504,10 +663,11 @@ void error_usage() {
    fprintf(stderr, "Example: run model.bin -n 256 -i \"Once upon a time\"\n");
    fprintf(stderr, "Options:\n");
    fprintf(stderr, "  -t <float>  temperature, default 1.0\n");
-    fprintf(stderr, "  -p <float>  p value in top-p (nucleus) sampling. default 0.9, 0 = off\n");
+    fprintf(stderr, "  -p <float>  p value in top-p (nucleus) sampling. default 0.9\n");
    fprintf(stderr, "  -s <int>    random seed, default time(NULL)\n");
    fprintf(stderr, "  -n <int>    number of steps to run for, default 256. 0 = max_seq_len\n");
    fprintf(stderr, "  -i <string> input prompt\n");
+    fprintf(stderr, "  -z <string> optional path to custom tokenizer\n");
    exit(EXIT_FAILURE);
 }

@@ -515,8 +675,9 @@ int main(int argc, char *argv[]) {

    // default inits
    char *checkpoint = NULL;  // e.g. out/model.bin
+    char *tokenizer = "tokenizer.bin";
    float temperature = 1.0f; // 0.0 = greedy deterministic. 1.0 = original. don't set higher
-    float topp = 0.9f;        // top-p in nucleus sampling
+    float topp = 0.9f;        // top-p in nucleus sampling. 1.0 = off. 0.9 works well, but slower
    rng_seed = 0; // seed rng with time by default
    int steps = 256;          // number of steps to run for
    char *prompt = NULL;      // prompt string
@@ -534,46 +695,63 @@ int main(int argc, char *argv[]) {
        else if (argv[i][1] == 's') { rng_seed = atoi(argv[i + 1]); }
        else if (argv[i][1] == 'n') { steps = atoi(argv[i + 1]); }
        else if (argv[i][1] == 'i') { prompt = argv[i + 1]; }
+        else if (argv[i][1] == 'z') { tokenizer = argv[i + 1]; }
        else { error_usage(); }
    }
-    if(rng_seed == 0) { rng_seed =  (unsigned int)time(NULL);}
+    // input validations
+    // our rng cannot accoupt 0 as a seed, so might as well use time(NULL) as default
+    if(rng_seed == 0) { rng_seed = (unsigned int)time(NULL); }

    // read in the model.bin file
    Config config;
    TransformerWeights weights;
    int fd = 0;         // file descriptor for memory mapping
-    float* data = NULL; // memory mapped data pointer
-    ssize_t file_size;     // size of the checkpoint file in bytes
+    void* data = NULL; // memory mapped data pointer
+    ssize_t file_size;  // size of the checkpoint file in bytes
    {
+        // first "peak" the checkpoint and extract metadata
        FILE *file = fopen(checkpoint, "rb");
        if (!file) { fprintf(stderr, "Couldn't open file %s\n", checkpoint); return 1; }
-        // read in the config header
+        // read in magic number (uint32), has to be 0x616b3432, i.e. "ak42" in ASCII
+        uint32_t magic_number;
+        if (fread(&magic_number, sizeof(uint32_t), 1, file) != 1) { return 1; }
+        if (magic_number != 0x616b3432) { fprintf(stderr, "Bad magic number\n"); return 1; }
+        // read in the version number (uint32), has to be 1
+        int version;
+        if (fread(&version, sizeof(int), 1, file) != 1) { return 1; }
+        if (version != 1) { fprintf(stderr, "Bad version number\n"); return 1; }
+        int header_size = 256; // the header size for version 1 in bytes
+        // read in the Config
        if (fread(&config, sizeof(Config), 1, file) != 1) { return 1; }
-        // negative vocab size is hacky way of signaling unshared weights. bit yikes.
-        int shared_weights = config.vocab_size > 0 ? 1 : 0;
-        config.vocab_size = abs(config.vocab_size);
-        // figure out the file size
+        // read in flags
+        uint8_t shared_classifier; // a byte to indicate if the classifier is shared
+        if (fread(&shared_classifier, sizeof(uint8_t), 1, file) != 1) { return 1; }
+        int group_size; // the group size used in quantization
+        if (fread(&group_size, sizeof(int), 1, file) != 1) { return 1; }
+        GS = group_size; // set as global, as it will be used in many places
+        // seek all the way to the end to figure out the full file size
        fseek(file, 0, SEEK_END); // move file pointer to end of file
        file_size = ftell(file); // get the file size, in bytes
        fclose(file);
-        // memory map the Transformer weights into the data pointer
+
+        // now memory map the Transformer weights into the data pointer
        fd = open(checkpoint, O_RDONLY); // open in read only mode
        if (fd == -1) { fprintf(stderr, "open failed!\n"); return 1; }
        data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
        if (data == MAP_FAILED) { fprintf(stderr, "mmap failed!\n"); return 1; }
-        float* weights_ptr = data + sizeof(Config)/sizeof(float);
-        checkpoint_init_weights(&weights, &config, weights_ptr, shared_weights);
+        void* weights_ptr = (char*)data + header_size; // skip header bytes. char is 1 byte
+        checkpoint_init_weights(&weights, &config, weights_ptr, shared_classifier);
    }
-    // right now we cannot run for more than config.seq_len steps
+    // we should not run for more than config.seq_len steps
    if (steps <= 0 || steps > config.seq_len) { steps = config.seq_len; }

-    // read in the tokenizer.bin file
+    // read in the tokenizer .bin file
    char** vocab = (char**)malloc(config.vocab_size * sizeof(char*));
    float* vocab_scores = (float*)malloc(config.vocab_size * sizeof(float));
    unsigned int max_token_length;
    {
-        FILE *file = fopen("tokenizer.bin", "rb");
-        if (!file) { fprintf(stderr, "couldn't load tokenizer.bin\n"); return 1; }
+        FILE *file = fopen(tokenizer, "rb");
+        if (!file) { fprintf(stderr, "couldn't load %s\n", tokenizer); return 1; }
        if (fread(&max_token_length, sizeof(int), 1, file) != 1) { fprintf(stderr, "failed read\n"); return 1; }
        int len;
        for (int i = 0; i < config.vocab_size; i++) {
@@ -594,7 +772,7 @@ int main(int argc, char *argv[]) {
    int *prompt_tokens = NULL;
    int num_prompt_tokens = 0;
    if (prompt != NULL) {
-        prompt_tokens = (int*)malloc(strlen(prompt) * sizeof(int));
+        prompt_tokens = (int*)malloc((strlen(prompt)+1) * sizeof(int));
        bpe_encode(prompt, vocab, vocab_scores, config.vocab_size, max_token_length, prompt_tokens, &num_prompt_tokens);
    }

@@ -623,7 +801,7 @@ int main(int argc, char *argv[]) {
                // apply softmax to the logits to get the probabilities for next token
                softmax(state.logits, config.vocab_size);
                // we sample from this distribution to get the next token
-                if (topp <= 0) {
+                if (topp <= 0 || topp >= 1) {
                    // simply sample from the predicted probability distribution
                    next = sample(state.logits, config.vocab_size);
                } else {
@@ -639,7 +817,20 @@ int main(int argc, char *argv[]) {

        // following BOS (1) token, sentencepiece decoder strips any leading whitespace (see PR #89)
        char *token_str = (token == 1 && vocab[next][0] == ' ') ? vocab[next]+1 : vocab[next];
-        printf("%s", token_str);
+        // careful, some tokens designate raw bytes, and look like e.g. '<0x01>'
+        unsigned char byte_val;
+        if (sscanf(token_str, "<0x%02hhX>", &byte_val) == 1) {
+            // ok this token is a raw byte token, carefuly to only print printable chars or whitespace
+            // some of the other bytes can be various control codes, backspace, etc. => skip
+            if (isprint(byte_val) || isspace(byte_val)) {
+                char byte_piece[2];
+                byte_piece[0] = byte_val;
+                byte_piece[1] = '\0';
+                printf("%s", byte_piece);
+            }
+        } else {
+            printf("%s", token_str);
+        }
        fflush(stdout);
        token = next;

@@ -89,6 +89,27 @@
        "cmd = f'./run {model_file} -t {temperature} -p {top_p} -n {max_token} -i \"{prompt}\"'\n",
        "!{cmd}"
      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "#@title Run Meta's Llama 2 models\n",
+        "\n",
+        "#@markdown input your huggingface [access token](https://huggingface.co/settings/tokens) to download Meta's Llama 2 models.\n",
+        "\n",
+        "from huggingface_hub import snapshot_download\n",
+        "\n",
+        "token = \"replace your huggingface access token\" #@param {type:\"string\"}\n",
+        "path = snapshot_download(repo_id=\"meta-llama/Llama-2-7b\",cache_dir=\"Llama-2-7b\", use_auth_token=token)\n",
+        "\n",
+        "!python export_meta_llama_bin.py $path llama2_7b.bin\n",
+        "\n",
+        "print(\"./run llama2_7b.bin\\n\")\n",
+        "!./run llama2_7b.bin"
+      ]
    }
  ],
  "metadata": {
@@ -5,17 +5,19 @@ import os
 import pickle
 from contextlib import nullcontext
 import torch
-import tiktoken
 from model import ModelArgs, Transformer
 from tokenizer import Tokenizer

+from tinystories import get_tokenizer_model_path
+
 # -----------------------------------------------------------------------------
-out_dir = 'out' # ignored if init_from is not 'resume'
+checkpoint = 'out/ckpt.pt'
 start = "" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
 num_samples = 1 # number of samples to draw
 max_new_tokens = 100 # number of tokens generated in each sample
 temperature = 1.0 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
 top_k = 300 # retain only the top_k most likely tokens, clamp others to have 0 probability
+tokenizer = "" # override the tokenizer model path
 seed = 1337
 device = 'cuda' if torch.cuda.is_available() else 'cpu' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
 #dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
@@ -33,11 +35,10 @@ ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torc
 ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

 # init from a model saved in a specific directory
-ckpt_path = os.path.join(out_dir, 'ckpt.pt')
-checkpoint = torch.load(ckpt_path, map_location=device)
-gptconf = ModelArgs(**checkpoint['model_args'])
+checkpoint_dict = torch.load(checkpoint, map_location=device)
+gptconf = ModelArgs(**checkpoint_dict['model_args'])
 model = Transformer(gptconf)
-state_dict = checkpoint['model']
+state_dict = checkpoint_dict['model']
 unwanted_prefix = '_orig_mod.'
 for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
@@ -51,7 +52,16 @@ if compile:
    model = torch.compile(model) # requires PyTorch 2.0 (optional)

 # load the tokenizer
-enc = Tokenizer()
+vocab_source = checkpoint_dict.get("vocab_source", "llama2")
+vocab_size = gptconf.vocab_size
+if tokenizer:
+    # a specific tokenizer is provided, use it
+    tokenizer_model = tokenizer
+else:
+    # let's try to find the tokenizer model automatically. bit gross here...
+    query_vocab_size = 0 if vocab_source == "llama2" else vocab_size
+    tokenizer_model = get_tokenizer_model_path(vocab_size=query_vocab_size)
+enc = Tokenizer(tokenizer_model=tokenizer_model)

 # encode the beginning of the prompt
 if start.startswith('FILE:'):
@@ -4,37 +4,71 @@ $ pytest
 """
 import os
 import pytest # pip install pytest
+import requests
 import subprocess

+
 import torch
 from model import ModelArgs, Transformer
+from tokenizer import Tokenizer

-def test_argmax_inference():
-    """
-    Only the simplest test for now: run inference with temperature 0 
-    (for determinism) in both C and PyTorch, and see that the sampled tokens 
-    are the same.
-    """
-    test_ckpt_dir = "out" # TODO create a dummy test checkpoint for this?
+# -----------------------------------------------------------------------------
+# test utilities

-    # run C version
-    model_path = os.path.join(test_ckpt_dir, "model.bin")
-    command = ["./run", model_path, "0.0"]
-    proc = subprocess.Popen(command, stdout=subprocess.PIPE)
-    c_tokens = []
-    for line in proc.stdout:
-        token = int(line.decode('utf-8').strip())
-        c_tokens.append(token)
-    proc.wait()
-    #print(c_tokens)
+test_ckpt_dir = "test"

-    # run PyTorch version
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    ckpt_path = os.path.join(test_ckpt_dir, "ckpt.pt")
-    checkpoint = torch.load(ckpt_path, map_location=device)
-    gptconf = ModelArgs(**checkpoint['model_args'])
+def download_file(url, filename):
+    print(f"Downloading {url} to {filename}")
+    response = requests.get(url, stream=True)
+    response.raise_for_status() # Raise an HTTPError on bad status code
+    with open(filename, 'wb') as file:
+        for chunk in response.iter_content(chunk_size=8192):
+            file.write(chunk)
+
+def attempt_download_files():
+    os.makedirs(test_ckpt_dir, exist_ok=True)
+    root_url = "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K"
+    need = ["stories260K.bin", "stories260K.pt", "tok512.bin", "tok512.model"]
+    for file in need:
+        url = root_url + '/' + file   #os.path.join inserts \\ on windows
+        filename = os.path.join(test_ckpt_dir, file)
+        if not os.path.exists(filename):
+            download_file(url, filename)
+
+expected_stdout = b'Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she saw a big, red ball. She wanted to play with it, but it was too high.\nLily\'s mom said, "Lily, let\'s go to the park." Lily was sad and didn\'t know what to do. She said, "I want to play with your ball, but I can\'t find it."\nLily was sad and didn\'t know what to do. She said, "I\'m sorry, Lily. I didn\'t know what to do."\nLily didn\'t want to help her mom, so she'
+
+# -----------------------------------------------------------------------------
+# actual tests
+
+def test_runc():
+    """ Forwards a model against a known-good desired outcome in run.c for 200 steps"""
+    attempt_download_files()
+
+    model_path = os.path.join(test_ckpt_dir, "stories260K.bin")
+    tokenizer_path = os.path.join(test_ckpt_dir, "tok512.bin")
+    command = ["./run", model_path, "-z", tokenizer_path, "-t", "0.0", "-n", "200"]
+    with open('err.txt', mode='wb') as fe:
+        with open('stdout.txt', mode='wb') as fo:
+            proc = subprocess.Popen(command, stdout=fo, stderr=fe)  #pipe in windows terminal does funny things like replacing \n with \r\n
+            proc.wait()
+
+    with open('stdout.txt', mode='r') as f:
+        stdout = f.read()
+    # strip the very last \n that is added by run.c for aesthetic reasons
+    stdout = stdout[:-1].encode('ascii')
+
+    assert stdout == expected_stdout
+
+def test_python():
+    """ Forwards a model against a known-good desired outcome in sample.py for 200 steps"""
+    attempt_download_files()
+
+    device = "cpu" # stories260K is small enough to just breeze through it on CPU
+    checkpoint = os.path.join(test_ckpt_dir, "stories260K.pt")
+    checkpoint_dict = torch.load(checkpoint, map_location=device)
+    gptconf = ModelArgs(**checkpoint_dict['model_args'])
    model = Transformer(gptconf)
-    state_dict = checkpoint['model']
+    state_dict = checkpoint_dict['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
@@ -44,10 +78,12 @@ def test_argmax_inference():
    model.to(device)
    x = torch.tensor([[1]], dtype=torch.long, device=device) # 1 is BOS
    with torch.inference_mode():
-        y = model.generate(x, max_new_tokens=gptconf.max_seq_len, temperature=0.0)
+        y = model.generate(x, max_new_tokens=200, temperature=0.0)
    pt_tokens = y[0].tolist()
-    pt_tokens = pt_tokens[1:] # remove BOS
-    #print(pt_tokens)

-    # compare
-    assert c_tokens == pt_tokens
+    tokenizer_model = os.path.join(test_ckpt_dir, "tok512.model")
+    enc = Tokenizer(tokenizer_model=tokenizer_model)
+    text = enc.decode(pt_tokens)
+    text = text.encode('ascii') # turn into bytes
+
+    assert text == expected_stdout
@@ -1,140 +0,0 @@
-"""
-Download, preprocess and serve the TinyShakespeare dataset as a DataLoader.
-
-Follows the same interface as the TinyStories dataset.
-"""
-
-import argparse
-import os
-import random
-
-import numpy as np
-import requests
-import torch
-import torch.distributed as dist
-from tqdm import tqdm
-
-from tokenizer import Tokenizer
-
-DATA_CACHE_DIR = "data"
-
-def download_file(url: str, fname: str, chunk_size=1024):
-    """Helper function to download a file from a given url"""
-    resp = requests.get(url, stream=True)
-    total = int(resp.headers.get("content-length", 0))
-    with open(fname, "wb") as file, tqdm(
-        desc=fname,
-        total=total,
-        unit="iB",
-        unit_scale=True,
-        unit_divisor=1024,
-    ) as bar:
-        for data in resp.iter_content(chunk_size=chunk_size):
-            size = file.write(data)
-            bar.update(size)
-
-
-def download():
-    """Downloads the dataset to disk."""
-    os.makedirs(DATA_CACHE_DIR, exist_ok=True)
-
-    # download the TinyShakespeare dataset, unless it's already downloaded
-    data_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
-    data_filename = os.path.join(DATA_CACHE_DIR, "tinyshakespeare.txt")
-    if not os.path.exists(data_filename):
-        print(f"Downloading {data_url} to {data_filename}...")
-        download_file(data_url, data_filename)
-    else:
-        print(f"{data_filename} already exists, skipping download...")
-
-    print("Download done.")
-
-def pretokenize():
-    enc = Tokenizer()
-
-    data_file = os.path.join(DATA_CACHE_DIR, "tinyshakespeare.txt")
-
-    all_tokens = []
-    with open(data_file, "r") as f:
-        for line in f:
-            text = line.strip()
-            tokens = enc.encode(text, bos=True, eos=False)
-            all_tokens.extend(tokens)
-    all_tokens = np.array(all_tokens, dtype=np.uint16)
-    print(f"Total tokens: {len(all_tokens)}")
-    with open(data_file.replace(".txt", ".bin"), "wb") as f:
-        f.write(all_tokens.tobytes())
-    print(f"Saved {data_file.replace('.txt', '.bin')}")
-    print("Done.")
-
-
-class PretokDataset(torch.utils.data.IterableDataset):
-    """Loads pretokenized examples from disk and yields them as PyTorch tensors."""
-
-    def __init__(self, split, max_seq_len):
-        super().__init__()
-        self.split = split
-        self.max_seq_len = max_seq_len
-
-    def __iter__(self):
-        # get worker info within a DataLoader
-        worker_info = torch.utils.data.get_worker_info()
-        worker_id = worker_info.id if worker_info else 0
-        # get DDP rank info
-        rank = dist.get_rank() if dist.is_initialized() else 0
-        # combine the worker_id and worker_rank to create a unique seed for rng
-        seed = 42 + worker_id + 1337 * rank
-        rng = random.Random(seed)
-        print(f"Created a PretokDataset with rng seed {seed}")
-        data_file = os.path.join(DATA_CACHE_DIR, "tinyshakespeare.bin")
-        m_all = np.memmap(data_file, dtype=np.uint16, mode="r")
-
-        # split out 10% of the data for validation
-        split_ix = int(len(m_all) * 0.9)
-        if self.split == "train":
-            m = m_all[:split_ix]
-        else:
-            m = m_all[split_ix:]
-
-        num_batches = len(m) // self.max_seq_len
-        num_batches -= 1  # drop the last partial batch
-        assert num_batches > 0, "this split is way too small? investigate."
-
-        while True:
-            ixs = list(range(num_batches))
-            rng.shuffle(ixs)
-            for ix in ixs:
-                start = ix * self.max_seq_len
-                end = start + self.max_seq_len + 1
-                # calling .astype will copy the data into a new numpy array, now in RAM
-                chunk = torch.from_numpy((m[start:end]).astype(np.int64))
-                x = chunk[:-1]
-                y = chunk[1:]
-                yield x, y
-
-
-class ShakespeareTask:
-
-    @staticmethod
-    def iter_batches(split, batch_size, max_seq_len, device, num_workers=0):
-        ds = PretokDataset(split, max_seq_len)
-        dl = torch.utils.data.DataLoader(
-            ds, batch_size=batch_size, pin_memory=True, num_workers=num_workers
-        )
-        for x, y in dl:
-            x = x.to(device, non_blocking=True)
-            y = y.to(device, non_blocking=True)
-            yield x, y
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("stage", type=str, choices=["download", "train_tokenizer", "pretokenize"])
-    args = parser.parse_args()
-
-    # depending on the stage call the appropriate function
-    fun = {
-        "download": download,
-        "pretokenize": pretokenize,
-    }
-    fun[args.stage]()
@@ -9,6 +9,7 @@ import os
 import random
 from typing import List
 from concurrent.futures import ProcessPoolExecutor
+from functools import partial

 import numpy as np
 import requests
@@ -37,7 +38,7 @@ def download_file(url: str, fname: str, chunk_size=1024):


 def download():
-    """Downloads the dataset to disk."""
+    """Downloads the TinyStories dataset to DATA_CACHE_DIR"""
    os.makedirs(DATA_CACHE_DIR, exist_ok=True)

    # download the TinyStories dataset, unless it's already downloaded
@@ -66,10 +67,61 @@ def download():
    print(f"Number of shards: {len(shard_filenames)}")
    print(f"Example story:\n{data[0]}")

+def train_vocab(vocab_size):
+    """
+    Trains a custom sentencepiece tokenizer on the TinyStories dataset.
+    The custom tokenizer files will be saved in DATA_CACHE_DIR/tok{N} directories,
+    where N is the vocab size. This is also where the pretok .bin files will go.
+    """
+    assert vocab_size > 0, "Vocab size must be positive"

-def process_shard(args):
+    # output file prefix path for sentencepiece
+    prefix = os.path.join(DATA_CACHE_DIR, f"tok{vocab_size}")
+
+    # how many shards we'll use for vocab training, kept low for efficiency
+    num_shards = 10
+
+    # 1) export a large chunk of text as a single text file tiny.txt
+    tiny_file = os.path.join(DATA_CACHE_DIR, "tiny.txt")
+    data_dir = os.path.join(DATA_CACHE_DIR, "TinyStories_all_data")
+    shard_filenames = sorted(glob.glob(os.path.join(data_dir, "*.json")))
+
+    print(f"Writing temporary file {tiny_file} with {num_shards} shards...")
+    with open(tiny_file, "w") as of:
+        for shard in tqdm(shard_filenames[:num_shards]):
+            with open(shard, "r") as f:
+                data = json.load(f)
+            for example in data:
+                text = example["story"]
+                text = text.strip()
+                of.write(text + "\n")
+    print(f"Size is: {os.path.getsize(tiny_file) / 1024 / 1024:.2f} MB")
+
+    # 2) run the train_vocab.sh script that trains the sentencepiece model
+    print("Will now train the vocab with:")
+    cmd = f"bash train_vocab.sh {tiny_file} {prefix} {vocab_size}"
+    print(cmd)
+    print("OK? [y/N] ")
+    dec = input()
+    if dec.lower() != "y":
+        print("Exiting...")
+        return
+    os.system(cmd)
+
+    # 3) optional cleanup, ask the user if they'd like to delete tiny.txt
+    dec = input(f"Delete the temporary file {tiny_file}? [y/N] ")
+    if dec.lower() == "y":
+        os.remove(tiny_file)
+        print(f"Deleted {tiny_file}")
+
+    print(f"Trained tokenizer is in {prefix}.model")
+    print("Done.")
+
+
+def process_shard(args, vocab_size):
    shard_id, shard = args
-    enc = Tokenizer()
+    tokenizer_model = get_tokenizer_model_path(vocab_size)
+    enc = Tokenizer(tokenizer_model)
    with open(shard, "r") as f:
        data = json.load(f)
    all_tokens = []
@@ -80,31 +132,49 @@ def process_shard(args):
        all_tokens.extend(tokens)
    # convert to uint16 nparray
    all_tokens = np.array(all_tokens, dtype=np.uint16)
-    # write to disk
-    tokenized_filename = shard.replace(".json", ".bin")
+    # calculate the output filename
+    if vocab_size == 0:
+        # if we're using Llama 2, just save the tokenized file in the same dir
+        tokenized_filename = shard.replace(".json", ".bin")
+    else:
+        # save .bin files into a new tok{N} directory
+        bin_dir = os.path.join(DATA_CACHE_DIR, f"tok{vocab_size}")
+        shard_basename = os.path.basename(shard)
+        bin_basename = shard_basename.replace(".json", ".bin")
+        tokenized_filename = os.path.join(bin_dir, bin_basename)
+    # write the bytes
    with open(tokenized_filename, "wb") as f:
        f.write(all_tokens.tobytes())
-    print(f"Saved {tokenized_filename}")
+    # calculate the average sequence length (they are separated by BOS=1)
+    avg_seq_len = all_tokens.size / ((all_tokens == 1).sum())
+    print(f"Saved {tokenized_filename}, average seqlen: {avg_seq_len:.2f}")


-def pretokenize():
+def pretokenize(vocab_size):
    # iterate the shards and tokenize all of them one by one
    data_dir = os.path.join(DATA_CACHE_DIR, "TinyStories_all_data")
    shard_filenames = sorted(glob.glob(os.path.join(data_dir, "*.json")))
+    if vocab_size > 0:
+        # .bin files will be saved into tok{N} directory, create it once here
+        bin_dir = os.path.join(DATA_CACHE_DIR, f"tok{vocab_size}")
+        os.makedirs(bin_dir, exist_ok=True)

    # process all the shards in a process pool
+    fun = partial(process_shard, vocab_size=vocab_size)
    with ProcessPoolExecutor() as executor:
-        executor.map(process_shard, enumerate(shard_filenames))
+        executor.map(fun, enumerate(shard_filenames))
    print("Done.")


 class PretokDataset(torch.utils.data.IterableDataset):
    """Loads pretokenized examples from disk and yields them as PyTorch tensors."""

-    def __init__(self, split, max_seq_len):
+    def __init__(self, split, max_seq_len, vocab_size, vocab_source):
        super().__init__()
        self.split = split
        self.max_seq_len = max_seq_len
+        self.vocab_size = vocab_size
+        self.vocab_source = vocab_source

    def __iter__(self):
        # get worker info within a DataLoader
@@ -116,8 +186,14 @@ class PretokDataset(torch.utils.data.IterableDataset):
        seed = 42 + worker_id + 1337 * rank
        rng = random.Random(seed)
        print(f"Created a PretokDataset with rng seed {seed}")
-        data_dir = os.path.join(DATA_CACHE_DIR, "TinyStories_all_data")
-        shard_filenames = sorted(glob.glob(os.path.join(data_dir, "*.bin")))
+        if self.vocab_source == "llama2":
+            # the .bin files are right along the .json files
+            bin_dir = os.path.join(DATA_CACHE_DIR, "TinyStories_all_data")
+            shard_filenames = sorted(glob.glob(os.path.join(bin_dir, "*.bin")))
+        elif self.vocab_source == "custom":
+            # the .bin files are in tok{N} directory
+            bin_dir = os.path.join(DATA_CACHE_DIR, f"tok{self.vocab_size}")
+            shard_filenames = sorted(glob.glob(os.path.join(bin_dir, "*.bin")))
        # train/test split. let's use only shard 0 for test split, rest train
        shard_filenames = shard_filenames[1:] if self.split == "train" else shard_filenames[:1]
        while True:
@@ -139,12 +215,25 @@ class PretokDataset(torch.utils.data.IterableDataset):
                    y = chunk[1:]
                    yield x, y

+# -----------------------------------------------------------------------------
+# public interface functions
+
+def get_tokenizer_model_path(vocab_size):
+    """
+    Returns path to the sentencepiece tokenizer model for a given vocab size
+    vocab_size = 0 designates the default Llama 2 tokenizer, in that case
+    None is returned.
+    """
+    if vocab_size == 0:
+        return None
+    else:
+        return os.path.join(DATA_CACHE_DIR, f"tok{vocab_size}.model")

 class Task:

    @staticmethod
-    def iter_batches(split, batch_size, max_seq_len, device, num_workers=0):
-        ds = PretokDataset(split, max_seq_len)
+    def iter_batches(batch_size, device, num_workers=0, **dataset_kwargs):
+        ds = PretokDataset(**dataset_kwargs)
        dl = torch.utils.data.DataLoader(
            ds, batch_size=batch_size, pin_memory=True, num_workers=num_workers
        )
@@ -153,16 +242,33 @@ class Task:
            y = y.to(device, non_blocking=True)
            yield x, y

+# -----------------------------------------------------------------------------
+# CLI for constructing the dataset

 if __name__ == "__main__":
+    """
+    These stages are designed to be run in order.
+
+    To tokenize data with the Llama 2 tokenizer:
+    python tinystories.py download
+    python tinystories.py pretokenize
+
+    To tokenize data with a custom tokenizer we train ourselves with sentencepiece, e.g.:
+    python tinystories.py download
+    python tinystories.py train_vocab --vocab_size=2048
+    python tinystories.py pretokenize --vocab_size=2048
+    """
    parser = argparse.ArgumentParser()
-    parser.add_argument("stage", type=str, choices=["download", "train_tokenizer", "pretokenize"])
+    parser.add_argument("stage", type=str, choices=["download", "pretokenize", "train_vocab"])
+    parser.add_argument("--vocab_size", type=int, default=0, help="pretokenization vocab size. 0 = use Llama 2 tokenizer.")
    args = parser.parse_args()

    # depending on the stage call the appropriate function
-    fun = {
-        "download": download,
-        "pretokenize": pretokenize,
-    }
-    fun[args.stage]()
-
+    if args.stage == "download":
+        download()
+    elif args.stage == "train_vocab":
+        train_vocab(vocab_size=args.vocab_size)
+    elif args.stage == "pretokenize":
+        pretokenize(vocab_size=args.vocab_size)
+    else:
+        raise ValueError(f"Unknown stage {args.stage}")
@@ -4,20 +4,19 @@

 import os
 import struct
-from logging import getLogger
+import argparse
 from typing import List

 from sentencepiece import SentencePieceProcessor

 TOKENIZER_MODEL = "tokenizer.model" # the llama sentencepiece tokenizer model
-TOKENIZER_BIN = "tokenizer.bin" # binary version of the tokenizer for inference in C

 class Tokenizer:
-    def __init__(self):
-        model_path = TOKENIZER_MODEL
+    def __init__(self, tokenizer_model=None):
+        model_path = tokenizer_model if tokenizer_model else TOKENIZER_MODEL
        assert os.path.isfile(model_path), model_path
        self.sp_model = SentencePieceProcessor(model_file=model_path)
-        #print(f"Loaded SentencePiece model from {model_path}")
+        self.model_path = model_path

        # BOS / EOS token IDs
        self.n_words: int = self.sp_model.vocab_size()
@@ -52,24 +51,28 @@ class Tokenizer:
                t = '\n<s>\n'
            elif i == self.eos_id:
                t = '\n</s>\n'
-            elif len(t) == 6 and t.startswith('<0x') and t.endswith('>'):
-                t = chr(int(t[3:5], 16)) # e.g. make '<0x01>' into '\x01'
            t = t.replace('▁', ' ') # sentencepiece uses this character as whitespace
            b = t.encode('utf-8') # bytes of this token, utf-8 encoded

            tokens.append(b)
            scores.append(s)
-        
+
        # record the max token length
        max_token_length = max(len(t) for t in tokens)

        # write to a binary file
-        with open(TOKENIZER_BIN, 'wb') as f:
+        # the tokenizer.bin file is the same as .model file, but .bin
+        tokenizer_bin = self.model_path.replace('.model', '.bin')
+        with open(tokenizer_bin, 'wb') as f:
            f.write(struct.pack("I", max_token_length))
            for bytes, score in zip(tokens, scores):
                f.write(struct.pack("fI", score, len(bytes)))
                f.write(bytes)

 if __name__ == "__main__":
-    t = Tokenizer()
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-t", "--tokenizer-model", type=str, help="optional path to custom tokenizer ")
+    args = parser.parse_args()
+
+    t = Tokenizer(args.tokenizer_model)
    t.export()
@@ -29,7 +29,6 @@ from torch.distributed import destroy_process_group, init_process_group
 from torch.nn.parallel import DistributedDataParallel as DDP

 from tinystories import Task
-from tinyshakespeare import ShakespeareTask

 # -----------------------------------------------------------------------------
 # I/O
@@ -47,11 +46,13 @@ wandb_run_name = "run" + datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
 # data
 batch_size = 128  # if gradient_accumulation_steps > 1, this is the micro-batch size
 max_seq_len = 256
-dataset = "tinystories"  # tinystories|tinyshakespeare
+vocab_source = "llama2" # llama2|custom; use Lllama 2 vocab from Meta, or custom trained
+vocab_size = 32000 # the Llama 2 tokenizer has 32K tokens
 # model
 dim = 288
 n_layers = 6
 n_heads = 6
+n_kv_heads = 6
 multiple_of = 32
 dropout = 0.0
 # adamw optimizer
@@ -83,6 +84,10 @@ config = {k: globals()[k] for k in config_keys}  # will be useful for logging
 lr_decay_iters = max_iters  # should be ~= max_iters per Chinchilla
 min_lr = 0.0  # minimum learning rate, should be ~= learning_rate/10 per Chinchilla

+# validating checks
+assert vocab_source in ["llama2", "custom"]
+assert vocab_source == "custom" or vocab_size == 32000, "The vocab from Meta has 32K tokens"
+
 # various inits, derived attributes, I/O setup
 ddp = int(os.environ.get("RANK", -1)) != -1  # is this a ddp run?
 if ddp:
@@ -123,11 +128,12 @@ ctx = (
 )

 # task-specific setup
-task = {'tinystories': Task, 'tinyshakespeare': ShakespeareTask}[dataset]
 iter_batches = partial(
-    task.iter_batches,
+    Task.iter_batches,
    batch_size=batch_size,
    max_seq_len=max_seq_len,
+    vocab_size=vocab_size,
+    vocab_source=vocab_source,
    device=device,
    num_workers=0,
 )
@@ -141,8 +147,8 @@ model_args = dict(
    dim=dim,
    n_layers=n_layers,
    n_heads=n_heads,
-    n_kv_heads=n_heads,
-    vocab_size=32000,
+    n_kv_heads=n_kv_heads,
+    vocab_size=vocab_size,
    multiple_of=multiple_of,
    max_seq_len=max_seq_len,
    dropout=dropout,
@@ -206,7 +212,7 @@ def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
-        batch_iter = iter_batches(split)
+        batch_iter = iter_batches(split=split)
        losses = torch.zeros(eval_iters)  # keep on CPU
        for k in range(eval_iters):
            X, Y = next(batch_iter)
@@ -238,7 +244,7 @@ if wandb_log and master_process:
    wandb.init(project=wandb_project, name=wandb_run_name, config=config)

 # training loop
-train_batch_iter = iter_batches("train")
+train_batch_iter = iter_batches(split="train")
 X, Y = next(train_batch_iter)  # fetch the very first batch
 t0 = time.time()
 local_iter_num = 0  # number of iterations in the lifetime of this process
@@ -0,0 +1,126 @@
+#!/bin/bash
+
+# Trains a sentencepiece tokenizer model on a bunch of given data, my best
+# effort attempt to replicate how Meta trained their Llama 2 tokenizer.
+
+# usage: $ train_vocab.sh <input> <model_prefix> <vocab_size>
+# example:
+# ./train_vocab.sh tiny.txt tokenizer_tiny 1024
+# requirements:
+# install https://github.com/google/sentencepiece
+
+# check if the correct number of arguments are provided
+if [ $# -ne 3 ]; then
+    echo "Usage: $0 <input> <model_prefix> <vocab_size>"
+    exit 1
+fi
+
+# assign command-line arguments to variables
+input=$1
+model_prefix=$2
+vocab_size=$3
+
+# check if input file exists
+if [ ! -f "$input" ]; then
+    echo "Usage: $0 <input> <model_prefix> <vocab_size>"
+    echo "input '$input' not found."
+    exit 1
+fi
+
+# check if vocab_size is a positive integer
+if ! [[ "$vocab_size" =~ ^[0-9]+$ ]] || [ "$vocab_size" -lt 1 ]; then
+    echo "Usage: $0 <input> <model_prefix> <vocab_size>"
+    echo "vocab_size size must be a positive integer."
+    exit 1
+fi
+
+# Print the processed inputs
+echo "Input: $input"
+echo "Model Prefix: $model_prefix"
+echo "Vocabulary Size: $vocab_size"
+
+# train a sentencepiece tokenizer model
+# Llama 2 config can be printed as follows:
+
+# import sentencepiece.sentencepiece_model_pb2
+# mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
+# mp.ParseFromString(open("tokenizer.model", "rb").read())
+# print(mp.trainer_spec)
+# print(mp.normalizer_spec)
+
+# this gives:
+
+# trainer_spec {
+#   input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
+#   model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
+#   model_type: BPE
+#   vocab_size: 32000
+#   self_test_sample_size: 0
+#   input_format: "text"
+#   character_coverage: 0.9999499917030334
+#   input_sentence_size: 200000000
+#   seed_sentencepiece_size: 1000000
+#   shrinking_factor: 0.75
+#   num_threads: 80
+#   num_sub_iterations: 2
+#   max_sentence_length: 4192
+#   shuffle_input_sentence: true
+#   max_sentencepiece_length: 16
+#   split_by_unicode_script: true
+#   split_by_whitespace: true
+#   split_by_number: true
+#   treat_whitespace_as_suffix: false
+#   split_digits: true
+#   allow_whitespace_only_pieces: true
+#   vocabulary_output_piece_score: true
+#   hard_vocab_limit: true
+#   use_all_vocab: false
+#   byte_fallback: true
+#   required_chars: ""
+#   unk_id: 0
+#   bos_id: 1
+#   eos_id: 2
+#   pad_id: -1
+#   unk_surface: " \342\201\207 "
+#   unk_piece: "<unk>"
+#   bos_piece: "<s>"
+#   eos_piece: "</s>"
+#   pad_piece: "<pad>"
+#   train_extremely_large_corpus: false
+#   enable_differential_privacy: false
+#   differential_privacy_noise_level: 0.0
+#   differential_privacy_clipping_threshold: 0
+# }
+# normalizer_spec {
+#   name: "identity"
+#   precompiled_charsmap: ""
+#   add_dummy_prefix: true
+#   remove_extra_whitespaces: false
+#   normalization_rule_tsv: ""
+# }
+
+# let's now use spm_train to train this exact model
+# options docs: https://github.com/google/sentencepiece/blob/master/doc/options.md
+
+# we'll depart on a few settings:
+# character_coverage -> 1.0
+
+# other important notes:
+# --split-digits = true, per the paper
+# --allow_whitespace_only_pieces is true, default in spm is false
+# --byte_fallback is true, default in spm is false
+# --normalization_rule_name is identity, default in spm is nmt_nfkc
+
+spm_train --input="$input" \
+          --model_prefix="$model_prefix" \
+          --model_type=bpe \
+          --vocab_size="$vocab_size" \
+          --self_test_sample_size=0 \
+          --input_format="text" \
+          --character_coverage=1.0 \
+          --num_threads="$(nproc)" \
+          --split_digits=true \
+          --allow_whitespace_only_pieces=true \
+          --byte_fallback=true \
+          --unk_surface=" \342\201\207 " \
+          --normalization_rule_name=identity \
Author	SHA1	Message	Date
Andrej Karpathy	039a9713c2	ok this first version works but i don't think is ready to merge, have to think on more	2023-08-18 15:44:02 +00:00
Andrej Karpathy	591f1353c7	ok this works but is super slow because we are doing all the work in fp32 still	2023-08-18 03:40:18 +00:00
Andrej Karpathy	e9cbe3e84f	small improvements to comments and warnings and increase header size during model export	2023-08-17 14:32:22 +00:00
Andrej Karpathy	5e2e5b28f4	re-write the model export to do int8 quantization in groups, with group size fallback, and also change the header to be much better	2023-08-17 05:56:20 +00:00
Andrej Karpathy	bd182289c5	calculate the freq_cis online, no need to write/read them to/from checkpoints	2023-08-17 04:13:13 +00:00
Andrej	b68a6d2ab5	Merge pull request #307 from madroidmaq/master Jupter Notebook: Add run Meta's Llama 2 models	2023-08-16 20:09:32 -07:00
Andrej	57bf0e9ee4	Merge pull request #306 from rdentato/patch-utf8-no-validation minimal protection against invalid UTF8 encoding.	2023-08-16 09:51:11 -07:00
madroid	9fbe96fc2e	Jupter Notebook: Add run Meta's Llama 2 models	2023-08-16 20:27:28 +08:00
rdentato	55e60740f5	Added space to str_buffer in case max_token_length is 1.	2023-08-16 07:58:07 +00:00
rdentato	befe4867b3	minimal protection against invalid UTF8 encoding.	2023-08-16 07:42:53 +00:00
Andrej	df6557a10d	Merge pull request #267 from krrishnarraj/master Update readme for openmp on mac	2023-08-15 19:26:34 -07:00
Andrej Karpathy	65c899314c	Merge branch 'Majdoddin-ci-tiny-model'	2023-08-16 02:22:26 +00:00
Andrej Karpathy	62a6d69d86	style changes and remove spurious runc test call at the bottom	2023-08-16 02:22:13 +00:00
Andrej Karpathy	d47fc41b6a	Merge branch 'ci-tiny-model' of https://github.com/Majdoddin/llama2.c into Majdoddin-ci-tiny-model	2023-08-16 02:20:34 +00:00
Andrej Karpathy	ca67253f28	smallfix: not sure what the point of this indirection was	2023-08-15 16:09:33 +00:00
Andrej Karpathy	4c63c5608d	shorten top comment on run.c file	2023-08-15 16:07:48 +00:00
Andrej Karpathy	a47f9b3969	collapsing copy paste code because it's driving my ocd crazy	2023-08-15 16:03:11 +00:00
Ruhollah Majdoddin	87b11edf27	modifiying test_all so it can safely run on windows	2023-08-15 16:01:53 +00:00
Ruhollah Majdoddin	66c9f5e6c8	Adding pytest with the tiny model to macOS and windows (except amd64_arm64) runners	2023-08-15 15:58:04 +00:00
Andrej Karpathy	88eb238255	add tests into Makefile convenience	2023-08-15 15:57:27 +00:00
Andrej	600cedb33d	Merge pull request #297 from karpathy/feature/utf8 Add UTF-8 support to prompts	2023-08-14 19:54:49 -07:00
Andrej Karpathy	fe2de68688	fix sample.py from tokenizer changes before	2023-08-15 02:33:01 +00:00
Andrej Karpathy	a9a0628c92	thoroughly commented the UTF-8 byte reading code	2023-08-15 02:18:49 +00:00
Andrej Karpathy	d459fd4243	add back careful processing of the byte tokens	2023-08-15 01:42:33 +00:00
Andrej Karpathy	4bf36ecc17	get rid of the special byte decoding logic	2023-08-15 01:04:10 +00:00
Andrej Karpathy	8417cb438d	Merge branch 'utf8' of https://github.com/atamurad/llama2.c into feature/utf8	2023-08-15 00:18:53 +00:00
Andrej Karpathy	94a3a5e0a5	Merge branch 'master' of github.com:karpathy/llama2.c	2023-08-14 14:52:15 +00:00
Andrej Karpathy	32c1ff97fb	missed p->dim to kv_dim for k,v vectors. we're not doing anything wrong we're just being wasteful with memory. thanks @xefoci7612 for pointing out	2023-08-14 14:52:07 +00:00
Andrej	013e012b87	Merge pull request #286 from Nick-infinity/master [Feat]: Add support for meta llama hf model conversion	2023-08-14 07:46:39 -07:00
Andrej	50f970d170	Merge pull request #289 from chenyangMl/update_readme Update readme to introduce llama2.c-zh	2023-08-14 07:41:13 -07:00
chenyang	2a9a4c4e14	update readme wiht a simple line to introduce llama2.c-zh	2023-08-14 15:12:30 +08:00
chenyang	79900ff68e	update readme wiht a simple line to introduce llama2.c-zh	2023-08-14 15:00:33 +08:00
Krishnaraj Bhat	eec9ad5a5b	Merge remote-tracking branch 'upstream/master'	2023-08-14 12:02:40 +05:30
Andrej Karpathy	82ad2ba34e	remove tiktoken as dependency	2023-08-14 05:53:57 +00:00
Nikhil Gupta	c39f19f1a9	[Feat]: Add support for meta llama hf model conversion Description: Llama 2 hf models have weights stored with diff name Signed-off-by: Nikhil Gupta <nikhilg.me@gmail.com>	2023-08-14 10:18:51 +05:30
Andrej	bae0bcf484	Small tweaks to Readme intro	2023-08-13 20:03:00 -07:00
Andrej Karpathy	45afa91dca	the accum function has been bothering me, there is no real need to add a function here, it does something trivial and is only used twice, scrap	2023-08-14 02:54:27 +00:00
Andrej Karpathy	854c97b660	turn topp 0.9 back on by default thanks to recent PR contributions truncating before quicksort	2023-08-14 00:12:45 +00:00
Andrej	4a2c375df9	Merge pull request #276 from jrudolph/improve-top-p optimize sample_topp by filtering out small value elements up front	2023-08-13 17:05:38 -07:00
Andrej	b3d6a9e6b5	Merge pull request #285 from karpathy/feature/civ2 Upgrading CI to run our new pytest	2023-08-13 16:55:01 -07:00
Andrej	091c799653	Merge branch 'master' into feature/civ2	2023-08-13 16:54:24 -07:00
Andrej Karpathy	c970f69334	oops i should probably call this function lol	2023-08-13 23:48:01 +00:00
Andrej Karpathy	223a67048a	add optional manual dispatch of actions	2023-08-13 23:39:37 +00:00
Andrej Karpathy	86325bf7e8	attempt to upgrade the CI to run our pytest	2023-08-13 23:35:29 +00:00
Andrej	b51c63b9f2	Merge pull request #283 from wizzard0/wizzard0-mention-1 Add TypeScript port	2023-08-13 14:36:10 -07:00
Andrej Karpathy	8506036185	remove 'revive tests' as a todo from the readme	2023-08-13 21:23:27 +00:00
Andrej Karpathy	f0024cfc88	revive tests. now that we have a tiny stories260K model this only requires a 2MB download. phew	2023-08-13 21:22:44 +00:00
Andrej	0805cb2c31	tiny whitespace fix to try to eliminate scrollbar	2023-08-13 13:40:09 -07:00
Andrej	b2cce341e0	oops typo fix in readme	2023-08-13 13:39:12 -07:00
Andrej Karpathy	3e989e21f2	link to stories260K model	2023-08-13 20:38:05 +00:00
Andrej Karpathy	58075b5ac5	update API of sample.py to be better, small changes here	2023-08-13 20:31:32 +00:00
atamyrat	36b54321e5	bugfix: allocate +1 in tokens buffer for dummy whitespace	2023-08-13 23:23:32 +03:00
Andrej	1bcb2d18d6	Merge pull request #284 from karpathy/feature/customtokenizer multiquery support add	2023-08-13 12:38:06 -07:00
Andrej Karpathy	38bfac90a8	bigchange: add multiquery support in run.c. we can now train and inference multiquery models (where n_kv_heads < n_heads). this also means that we, in principle, support Llama 2 34B and 70B models, which are multiquery	2023-08-13 19:34:05 +00:00
Andrej	b28c1e26c5	Merge pull request #275 from icppWorld/webassembly-internet-computer Notable fork section for WebAssembly	2023-08-13 10:14:39 -07:00
Andrej	5295cbb821	Merge pull request #281 from lintian06/original_llama2 Update README.md for a new rust port.	2023-08-13 10:14:00 -07:00
Andrej	12dec61fbf	Merge pull request #282 from mihainadas/master-1 Fixes https://github.com/karpathy/llama2.c/issues/280	2023-08-13 10:13:08 -07:00
Oleksandr Nikitin	0e6213c6e0	Mention I can run the full 7B model	2023-08-13 20:02:34 +03:00
Oleksandr Nikitin	1d68a36d14	Add TypeScript port I've never been so happy to have missed that the JS port already exists :D also it was nice to discover that the JS can reach 80% of the single-threaded C speed (10 tokens/s for TinyStories-110M)	2023-08-13 19:10:07 +03:00
Mihai Nadăș	570789aa04	Fixes https://github.com/karpathy/llama2.c/issues/280 There was a small bug in tinystories.py, described here: https://github.com/karpathy/llama2.c/issues/280 This commit simply passes vocab_size to get_tokenizer_model_path to avoid silent crash when processing shards (in process_shard)	2023-08-13 17:49:10 +03:00
Tian Lin	27adb082f1	Update README.md	2023-08-13 21:58:14 +08:00
atamyrat	daa9fd9b8a	sort vocabulary for faster lookup with bsearch()	2023-08-13 15:02:11 +03:00
Andrej	8b472ded1f	Merge pull request #272 from karpathy/feature/customtokenizer Big Change: Custom Tokenizer training: add the ability to train custom tokenizers instead of using the pretrained Llama 2 tokenizer. This is useful in custom, narrow-domain LLMs because smaller vocab sizes make much smaller, faster, and potentially more capable models. For example, in tinystories a vocab size 4096 custom tokenizer compresses the input text sequences about as well as the Llama 2 tokenizer with vocab size 32000. The result is also "safer" because a badly trained model can't accidentally e.g. output some random chinese character and rapidly go "off the rails" in subsequent tokens.	2023-08-12 20:31:21 -07:00
Andrej Karpathy	9ff459b925	todo changes	2023-08-13 03:24:31 +00:00
Andrej Karpathy	1d14cb8dd8	add note about 4096 vs 32000 token size on tinystories	2023-08-13 03:19:35 +00:00
Andrej Karpathy	fe49eb222c	readme for custom tokenizers	2023-08-13 03:16:18 +00:00
Andrej Karpathy	9c3cfb46a3	make default be the llama2 tokenizer	2023-08-13 03:08:07 +00:00
Andrej Karpathy	00a61dc7f9	remove the tinyshakespeare dataset until i can bring it back later in a nicer form, otherwise right now we just have a ton of copy paste code here	2023-08-13 02:18:30 +00:00
Andrej Karpathy	f5fc0c245f	final piece: run.c support for new tokenizer, super ez	2023-08-13 02:12:13 +00:00
Andrej Karpathy	ea4cedc588	add ability to export custom tokenizer to .bin format for run.c file	2023-08-13 02:00:19 +00:00
Johannes Rudolph	d421a95b2b	optimize sample_topp by filtering out small value elements up front This works because we know that in worst case only 1 element will be selected and therefore the remaining (n-1) elements have to split the remaining (1-topp) probability. Probabilities smaller than that cannot be selected and can be filtered out up front.	2023-08-12 20:31:19 +02:00
Andrej Karpathy	b0cfa2458d	ok i can train and sample a model with a custom tokenizer	2023-08-11 16:47:29 +00:00
icpp	f96c7afb2d	Notable fork section for WebAssembly Added my repo `icpp-lmm` for running it on the Internet Computer	2023-08-11 10:11:32 -04:00
Andrej Karpathy	4c6f0af9ff	add the ability to train a custom sentencepiece tokenizer with a given vocab_size, and pretok with it. some more changes still needed to merge this branch, in train.py and ofc run.c. did this in a sadly bit ugly, but fully backwards compatible way. basically when we use custom tokenizer we create a whole new directory structure for that	2023-08-11 03:58:22 +00:00
Andrej Karpathy	c42641205f	turn off topp sampling by default because it is a bit too slow to be the default. it is likely that turning it on, e.g. -p 0.9 is midlly higher quality and safer samples, but this comes at a cost of too much performance in double digit percent sometimes, for it to be on by default i think...	2023-08-10 15:23:05 +00:00
Krishnaraj Bhat	46d7a6b6c6	Merge branch 'karpathy:master' into master	2023-08-10 11:06:19 +05:30
Krishnaraj Bhat	d45a36cdd2	Update readme for openmp on mac	2023-08-10 10:59:39 +05:30
atamyrat	c02865df30	prompt tokenizer improvements: utf8 support, add_dummy_prefix and byte_fallback options to match sentencepiece	2023-08-07 13:12:44 +03:00