ok this first version works but i don't think is ready to merge, have to think on more

ok this works but is super slow because we are doing all the work in fp32 still
small improvements to comments and warnings and increase header size during model export
2023-08-18 15:44:02 +00:00 · 2023-08-18 03:40:18 +00:00 · 2023-08-17 14:32:22 +00:00 · 2023-08-17 05:56:20 +00:00
12 changed files with 825 additions and 1128 deletions
@@ -55,14 +55,6 @@ test:
 testc:
 	pytest -k runc

-# run the C tests, without touching pytest / python
-# to increase verbosity level run e.g. as `make testcc VERBOSITY=1`
-VERBOSITY ?= 0
-.PHONY: testcc
-testcc:
-	$(CC) -DVERBOSITY=$(VERBOSITY) -O3 -o testc test.c -lm
-	./testc
-
 .PHONY: clean
 clean:
 	rm -f run
@@ -8,7 +8,7 @@ Train the Llama 2 LLM architecture in PyTorch then inference it with one simple

 As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing.

-Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compared to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.
+Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compred to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.

 ## feel the magic

@@ -65,13 +65,13 @@ Quick note on sampling, the recommendation for ~best results is to sample with `
 ## Meta's Llama 2 models

 As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format.
-For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export.py` file, e.g. for 7B model:
+For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export_meta_llama_bin.py` file, e.g. for 7B model:

 ```bash
-python export.py llama2_7b.bin --meta-llama path/to/llama/model/7B
+python export_meta_llama_bin.py path/to/llama/model/7B llama2_7b.bin
 ```

-The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts. I would not attempt to run anything above 7B right now for two reasons: first, 13B+ currently doesn't work because of integer flow in pointer arithmetic, which is yet to be fixed, and second, even if it were fixed, this repo is doing float32 inference right now, so it would be fairly unusably slow. Once the export is done, we can run it:
+The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts, the 13B export currently doesn't work for unknown reasons (accepting PRs for fix). We can run the model as normal:

 ```bash
 ./run llama2_7b.bin
@@ -83,22 +83,6 @@ This ran at about 4 tokens/s compiled with [OpenMP](#OpenMP) on 96 threads on my

 base models... ¯\\_(ツ)_/¯. Since we can inference the base model, it should be possible to also inference the chat model quite easily, and have a conversation with it. And if we can find a way to run 7B more efficiently, we can start adding LoRA to our training script, and going wild with finetunes all within the repo!

-You can also chat with the Llama Chat models. Export the chat model exactly as above:
-
-```bash
-python export.py llama2_7b_chat.bin --meta-llama /path/to/7B-chat
-```
-
-Then chat with it by specifying the chat mode using the `-m` flag, e.g.:
-
-```bash
-./run llama2_7b_chat.bin -m chat
-```
-
-## hugginface models
-
-We can load any huggingface models that use the Llama 2 architecture. See the script [export.py](export.py) and the `--hf` flag to export the model .bin file.
-
 ## models

 For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub [tinyllamas](https://huggingface.co/karpathy/tinyllamas), both in the original PyTorch .pt, and also in the llama2.c format .bin:
@@ -175,7 +159,7 @@ python tinystories.py train_vocab --vocab_size=4096
 python tinystories.py pretokenize --vocab_size=4096
 ```

-The `train_vocab` stage will call the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.
+The `train_vocab` stage will call the `train_vocab.sh` script, which calls the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.

 A quick note of interest is that vocab size of 4096 trained specifically on tinystories creates integer sequences with about the same sequence length per example as the default Llama 2 tokenizer of 32000 tokens! This means that our custom, tailored tokenizer is a lot better adapted to our specific text, and can compress it very effectively. So our trained models are smaller and faster.

@@ -219,7 +203,8 @@ You can also experiment with replacing `gcc` with `clang`.

 If compiling with gcc, try experimenting with `-funroll-all-loops`, see PR [#183](https://github.com/karpathy/llama2.c/pull/183)

-**OpenMP**. Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
+### OpenMP
+Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
 You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). Then you can compile with `make runomp`, which does:

 ```bash
@@ -232,8 +217,7 @@ When you run inference make sure to use OpenMP flags to set the number of thread
 OMP_NUM_THREADS=4 ./run out/model.bin
 ```

-Depending on your system resources you may want to tweak these hyperparameters and use more threads. But more is not always better, usually this is a bit U shaped. In particular, if your CPU has SMT (multithreading), try setting the number of threads to the number of physical cores rather than logical cores. The performance difference can be large due to cache thrashing and communication overhead. The PyTorch documentation [CPU specific optimizations
-](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#cpu-specific-optimizations) has some good information that applies here too.
+Depending on your system resources you may want to tweak these hyperparameters and use more threads. But more is not always better, usually this is a bit U shaped.

 ## platforms

@@ -254,14 +238,6 @@ $ pytest

 This will currently invoke two tests inside `test_all.py`, which forward the model in both C and Python for 200 steps and check the output against a known good expected output. The tests currently run in only a few seconds, but will have to download and cache the stories260K models in a temporary `test` directory (only ~2MB download).

-There are also some tests in C, in the file [test.c](test.c). You can run these with `make testcc`, or to see more stuff printed:
-
-```
-make testcc VERBOSITY=1
-```
-
-Call for help: help add more tests.
-
 ## ack

 I trained the llama2.c storyteller models on a 4X A100 40GB box graciously provided by the excellent [Lambda labs](https://lambdalabs.com/service/gpu-cloud), thank you.
@@ -295,7 +271,6 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
  - [llama2.rs](https://github.com/leo-du/llama2.rs) by @[leo-du](https://github.com/leo-du): A Rust port of this project
  - [llama2-rs](https://github.com/danielgrittner/llama2-rs) by @[danielgrittner](https://github.com/danielgrittner): a Rust port of this project
  - [llama2.rs](https://github.com/lintian06/llama2.rs) by @[lintian06](https://github.com/lintian06): A Rust port of this project
-  - [pecca.rs](https://github.com/rahoua/pecca-rs) by @[rahoua](https://github.com/rahoua): A Rust port leveraging [ndarray](https://github.com/rust-ndarray/ndarray), supports BLAS.
 - Go
  - [go-llama2](https://github.com/tmc/go-llama2) by @[tmc](https://github.com/tmc): a Go port of this project
  - [llama2.go](https://github.com/nikolaydubina/llama2.go) by @[nikolaydubina](https://github.com/nikolaydubina): a Go port of this project
@@ -326,8 +301,6 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
  - [llama2.py](https://github.com/tairov/llama2.py) by @[tairov](https://github.com/tairov): a simple one file pure Python port of this project with zero dependencies
 - C#
  - [llama2.cs](https://github.com/trrahul/llama2.cs) by @[trrahul](https://github.com/trrahul): a C# port of this project
- Dart
-  - [llama2.dart](https://github.com/yiminghan/llama2.dart) by @[yiminghan](https://github.com/yiminghan/llama2.dart): one-file dart port of this project, works with Flutter!
 - WebAssembly
  - [icpp-llm](https://github.com/icppWorld/icpp-llm): LLMs for the Internet Computer
 - [llama2.c - Llama 2 Everywhere](https://github.com/trholding/llama2.c) by @[trholding](https://github.com/trholding): Standalone, Bootable & Portable Binary Llama 2
@@ -335,12 +308,12 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg

 ## unsorted todos

- add support in run.c of reading version 1+ files from export, later deprecate "version 0"
- runq.c (int8 quantization) add
- run.cu (CUDA) investigate and merge
- add more tests inside [test.c](test.c)
- add Engine class for use in sample.py that does efficient inference in PyTorch, e.g. KV cache keeping
 - make it easier to add a new dataset with not too much pain
+- should calculate freq_cis online in the script run.c instead of loading them
+- int4/8 quantization
+- export the model in a more sensible output format with a proper header, etc.
+- support Llama 2 7B Chat models and tune run.c to Chat UI/UX
+- llama2.cu investigate and merge
 - (LoRA) finetuning and export of Llama 2 models

 ## License
@@ -1,470 +0,0 @@
-"""
-This script has functions and utilties for model export.
-Basically, we have a bunch of versions of the model, and we
-want to export them to .bin files to be read from and inferenced in C.
-
-Among the "input" versions of PyTorch files/models:
- Official Llama 2 weights released by Meta
- Huggingface weights available on the hub
- llama2.c (this repo) trained models
-
-Among the "output" versions of .bin files:
- v0: Legacy files of the original llama2.c repo (will eventually be DEPRECATED)
- v1-vN: Improved .bin files with a proper header, cache alignment, etc.
-
-This script aspires to provide all of these conversions.
-"""
-import os
-import gzip
-import shutil
-import struct
-import argparse
-import json
-from pathlib import Path
-
-import numpy as np
-import torch
-from torch import nn
-
-from model import ModelArgs, Transformer
-
-# -----------------------------------------------------------------------------
-# common utilities
-
-def serialize_fp32(file, tensor):
-    """ writes one fp32 tensor to file that is open in wb mode """
-    d = tensor.detach().cpu().view(-1).to(torch.float32).numpy()
-    b = struct.pack(f'{len(d)}f', *d)
-    file.write(b)
-
-def serialize_int8(file, tensor):
-    """ writes one int8 tensor to file that is open in wb mode """
-    d = tensor.detach().cpu().view(-1).numpy().astype(np.int8)
-    b = struct.pack(f'{len(d)}b', *d)
-    file.write(b)
-
-def quantize_q80(w, group_size):
-    """
-    takes a tensor and returns the Q8_0 quantized version
-    i.e. symmetric quantization into int8, range [-127,127]
-    """
-    assert w.numel() % group_size == 0
-    ori_shape = w.shape
-    w = w.float() # convert to float32
-    w = w.reshape(-1, group_size)
-    # find the max in each group
-    wmax = torch.abs(w).max(dim=1).values
-    # calculate the scaling factor such that float = quant * scale
-    scale = wmax / 127.0
-    # scale into range [-127, 127]
-    quant = w / scale[:,None]
-    # round to nearest integer
-    int8val = torch.round(quant).to(torch.int8)
-    # dequantize by rescaling
-    fp32val = (int8val.float() * scale[:,None]).view(-1)
-    fp32valr = fp32val.reshape(-1, group_size)
-    # calculate the max error in each group
-    err = torch.abs(fp32valr - w).max(dim=1).values
-    # find the max error across all groups
-    maxerr = err.max().item()
-    return int8val, scale, maxerr
-
-# -----------------------------------------------------------------------------
-# legacy
-
-def legacy_export(model, filepath):
-    """ Original export of llama2.c bin files, i.e. version v0 """
-    out_file = open(filepath, 'wb')
-
-    # first write out the header
-    hidden_dim = model.layers[0].feed_forward.w1.weight.shape[0]
-    p = model.params
-    shared_classifier = torch.equal(model.tok_embeddings.weight, model.output.weight)
-    # legacy format uses negative/positive vocab size as a shared classifier flag
-    if not shared_classifier:
-        p.vocab_size = -p.vocab_size
-    n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
-    header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
-                                    n_kv_heads, p.vocab_size, p.max_seq_len)
-    out_file.write(header)
-
-    # next write out the embedding weights
-    serialize_fp32(out_file, model.tok_embeddings.weight)
-
-    # now all the layers
-    # attention weights
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.attention_norm.weight)
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.attention.wq.weight)
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.attention.wk.weight)
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.attention.wv.weight)
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.attention.wo.weight)
-    # ffn weights
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.ffn_norm.weight)
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.feed_forward.w1.weight)
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.feed_forward.w2.weight)
-    for layer in model.layers:
-        serialize_fp32(out_file, layer.feed_forward.w3.weight)
-    # final rmsnorm
-    serialize_fp32(out_file, model.norm.weight)
-    # freqs_cis
-    serialize_fp32(out_file, model.freqs_cos[:p.max_seq_len])
-    serialize_fp32(out_file, model.freqs_sin[:p.max_seq_len])
-
-    # final classifier weights
-    if not shared_classifier:
-        serialize_fp32(out_file, model.output.weight)
-
-    # write to binary file
-    out_file.close()
-    print(f"wrote {filepath}")
-
-# -----------------------------------------------------------------------------
-# new version
-
-def version1_export(model, filepath):
-    """
-    Export the model weights in full float32 .bin file to be read from C.
-    This is same as legacy_export, but with a proper header.
-    """
-    version = 1
-
-    out_file = open(filepath, 'wb')
-    # first write out the header. the header will be 256 bytes
-    # 1) write magic, which will be uint32 of "ak42" in ASCII
-    out_file.write(struct.pack('I', 0x616b3432))
-    # 2) write version, which will be int
-    out_file.write(struct.pack('i', version))
-    # 3) write the params, which will be 7 ints
-    p = model.params
-    hidden_dim = model.layers[0].feed_forward.w1.weight.shape[0]
-    n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
-    header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
-                                    n_kv_heads, p.vocab_size, p.max_seq_len)
-    out_file.write(header)
-    # 4) write some other flags
-    shared_classifier = torch.equal(model.tok_embeddings.weight, model.output.weight)
-    out_file.write(struct.pack('B', int(shared_classifier)))
-    pad = 256 - out_file.tell() # pad rest with zeros; tell returns current pos
-    assert pad >= 0
-    out_file.write(b'\0' * pad)
-
-    # now let's write out all the params
-    weights = [
-        *[layer.attention_norm.weight for layer in model.layers],
-        *[layer.ffn_norm.weight for layer in model.layers],
-        model.norm.weight,
-        model.tok_embeddings.weight,
-        *[layer.attention.wq.weight for layer in model.layers],
-        *[layer.attention.wk.weight for layer in model.layers],
-        *[layer.attention.wv.weight for layer in model.layers],
-        *[layer.attention.wo.weight for layer in model.layers],
-        *[layer.feed_forward.w1.weight for layer in model.layers],
-        *[layer.feed_forward.w2.weight for layer in model.layers],
-        *[layer.feed_forward.w3.weight for layer in model.layers],
-    ]
-    if not shared_classifier:
-        weights.append(model.output.weight)
-    for w in weights:
-        serialize_fp32(out_file, w)
-
-    # write to binary file
-    out_file.close()
-    print(f"wrote {filepath}")
-
-def version2_export(model, filepath, group_size=64):
-    """
-    Export the model weights in Q8_0 into .bin file to be read from C.
-    That is:
-    - quantize all weights to symmetric int8, in range [-127, 127]
-    - all other tensors (the rmsnorm params) are kept and exported in fp32
-    - quantization is done in groups of group_size to reduce the effects of any outliers
-    """
-    version = 2
-
-    # let's first do some validation for this export type
-    while model.params.dim % group_size != 0:
-        group_size //= 2
-        print(f"BACKOFF: reducing group size to {group_size} to fit hidden_dim")
-    weights = [
-        model.tok_embeddings.weight,
-        *[layer.attention.wq.weight for layer in model.layers],
-        *[layer.attention.wk.weight for layer in model.layers],
-        *[layer.attention.wv.weight for layer in model.layers],
-        *[layer.attention.wo.weight for layer in model.layers],
-        *[layer.feed_forward.w1.weight for layer in model.layers],
-        *[layer.feed_forward.w2.weight for layer in model.layers],
-        *[layer.feed_forward.w3.weight for layer in model.layers],
-    ]
-    shared_classifier = torch.equal(model.tok_embeddings.weight, model.output.weight)
-    if not shared_classifier:
-        weights.append(model.output.weight)
-    for w in weights:
-        assert w.numel() % group_size == 0, f"weight {i} has numel {w.numel()}, not a multiple of group_size {group_size}"
-
-    # write
-    out_file = open(filepath, 'wb')
-    # first write out the header. the header will be 256 bytes
-    # 1) write magic, which will be uint32 of "ak42" in ASCII
-    out_file.write(struct.pack('I', 0x616b3432))
-    # 2) write version, which will be int
-    out_file.write(struct.pack('i', version))
-    # 3) write the params, which will be 7 ints
-    p = model.params
-    hidden_dim = model.layers[0].feed_forward.w1.weight.shape[0]
-    n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
-    header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
-                                    n_kv_heads, p.vocab_size, p.max_seq_len)
-    out_file.write(header)
-    # 4) write some other flags
-    out_file.write(struct.pack('B', int(shared_classifier)))
-    out_file.write(struct.pack('i', group_size)) # group size used for quantization
-    pad = 256 - out_file.tell() # pad rest with zeros; tell returns current pos
-    assert pad >= 0
-    out_file.write(b'\0' * pad)
-    # now that the header is done, let's write out the model
-
-    # first let's write out all the params that we are keeping in fp32: the norms
-    for layer in model.layers: # attention norms
-        serialize_fp32(out_file, layer.attention_norm.weight)
-    for layer in model.layers: # MLP norms
-        serialize_fp32(out_file, layer.ffn_norm.weight)
-    serialize_fp32(out_file, model.norm.weight) # final pre-classifier norm
-
-    # now let's write out all the params that we are quantizing to Q8_0
-    # note we skip classifier weights, which are shared with the embedding
-    ew = []
-    scales = []
-    for i, w in enumerate(weights):
-        # quantize this weight
-        q, s, err = quantize_q80(w, group_size)
-        # save the int8 weights to file
-        serialize_int8(out_file, q) # save the tensor in int8
-        scales.append(s)  # we'll do all the scales after all the qs
-        # logging
-        ew.append((err, w.shape))
-        print(f"{i+1}/{len(weights)} quantized {tuple(w.shape)} to Q8_0 with max error {err}")
-
-    # save the scaling factors in fp32 here
-    # this is done to keep all the weights contiquous, making pointer arithmetic easier in C
-    for s in scales:
-        serialize_fp32(out_file, s)
-
-    # print the highest error across all weights, should be very small, e.g. O(~0.001)
-    ew.sort(reverse=True)
-    print(f"max quantization group error across all weights: {ew[0][0]}")
-
-    # write to binary file
-    out_file.close()
-    print(f"wrote {filepath}")
-
-
-# -----------------------------------------------------------------------------
-# Load / import functions
-
-def load_checkpoint(checkpoint):
-
-    # load the provided model checkpoint
-    checkpoint_dict = torch.load(checkpoint, map_location='cpu')
-    gptconf = ModelArgs(**checkpoint_dict['model_args'])
-    model = Transformer(gptconf)
-    state_dict = checkpoint_dict['model']
-    unwanted_prefix = '_orig_mod.'
-    for k,v in list(state_dict.items()):
-        if k.startswith(unwanted_prefix):
-            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
-    model.load_state_dict(state_dict, strict=False)
-    model.eval()
-    return model
-
-def load_meta_model(model_path):
-    params_path = os.path.join(model_path, 'params.json')
-    with open(params_path) as f:
-        params = json.load(f)
-        print(params)
-
-    model_paths = sorted(list(Path(model_path).glob('consolidated.*.pth')))
-    models = [torch.load(p, map_location='cpu') for p in model_paths]
-
-    def concat_weights(models):
-        state_dict = {}
-        for name in list(models[0]):
-            tensors = [model[name] for model in models]
-            if len(tensors) == 1 or len(tensors[0].shape) == 1:
-                state_dict[name] = tensors[0]
-                continue
-            is_axis_1 = (
-                name.startswith('tok_embeddings.')
-                or name.endswith('.attention.wo.weight')
-                or name.endswith('.feed_forward.w2.weight')
-            )
-            axis = 1 if is_axis_1 else 0
-            state_dict[name] = torch.cat(tensors, dim=axis)
-            for model in models:
-                del model[name]
-        return state_dict
-
-    state_dict = concat_weights(models)
-    del models
-
-    # set ModelArgs
-    config = ModelArgs()
-    config.dim = params["dim"]
-    config.n_layers = params["n_layers"]
-    config.n_heads = params["n_heads"]
-    config.n_kv_heads = params.get('n_kv_heads') or params['n_heads']
-    config.multiple_of = params["multiple_of"]
-    config.norm_eps = params["norm_eps"]
-
-    config.vocab_size = 32000
-    config.max_seq_len = 2048
-
-    # create a new Transformer object and set weights
-    model = Transformer(config)
-
-    model.tok_embeddings.weight = nn.Parameter(state_dict['tok_embeddings.weight'])
-    model.norm.weight = nn.Parameter(state_dict['norm.weight'])
-
-    for layer in model.layers:
-        i = layer.layer_id
-        layer.attention_norm.weight = nn.Parameter(state_dict[f'layers.{i}.attention_norm.weight'])
-        layer.attention.wq.weight = nn.Parameter(state_dict[f'layers.{i}.attention.wq.weight'])
-        layer.attention.wk.weight = nn.Parameter(state_dict[f'layers.{i}.attention.wk.weight'])
-        layer.attention.wv.weight = nn.Parameter(state_dict[f'layers.{i}.attention.wv.weight'])
-        layer.attention.wo.weight = nn.Parameter(state_dict[f'layers.{i}.attention.wo.weight'])
-        layer.ffn_norm.weight = nn.Parameter(state_dict[f'layers.{i}.ffn_norm.weight'])
-        layer.feed_forward.w1.weight = nn.Parameter(state_dict[f'layers.{i}.feed_forward.w1.weight'])
-        layer.feed_forward.w2.weight = nn.Parameter(state_dict[f'layers.{i}.feed_forward.w2.weight'])
-        layer.feed_forward.w3.weight = nn.Parameter(state_dict[f'layers.{i}.feed_forward.w3.weight'])
-
-    # final classifier
-    model.output.weight = nn.Parameter(state_dict['output.weight'])
-    model.eval()
-    return model
-
-def load_hf_model(model_path):
-
-    try:
-        from transformers import AutoModelForCausalLM
-    except ImportError:
-        print("Error: transformers package is required to load huggingface models")
-        print("Please run `pip install transformers` to install it")
-        return None
-
-    # load HF model
-    hf_model = AutoModelForCausalLM.from_pretrained(model_path)
-    hf_dict = hf_model.state_dict()
-
-    # convert LlamaConfig to ModelArgs
-    config = ModelArgs()
-    config.dim = hf_model.config.hidden_size
-    config.n_layers = hf_model.config.num_hidden_layers
-    config.n_heads = hf_model.config.num_attention_heads
-    config.n_kv_heads = hf_model.config.num_attention_heads
-    config.vocab_size = hf_model.config.vocab_size
-    config.hidden_dim = hf_model.config.intermediate_size
-    config.norm_eps = hf_model.config.rms_norm_eps
-    config.max_seq_len = hf_model.config.max_position_embeddings
-
-    # create a new Transformer object and set weights
-    model = Transformer(config)
-
-    model.tok_embeddings.weight = nn.Parameter(hf_dict['model.embed_tokens.weight'])
-    model.norm.weight = nn.Parameter(hf_dict['model.norm.weight'])
-
-    # huggingface permutes WQ and WK, this function reverses it
-    def permute_reverse(w, n_heads=config.n_heads, dim1=config.dim, dim2=config.dim):
-        return w.view(n_heads, 2, dim1 // n_heads // 2, dim2).transpose(1, 2).reshape(dim1, dim2)
-
-    for layer in model.layers:
-        i = layer.layer_id
-        layer.attention_norm.weight = nn.Parameter(hf_dict[f'model.layers.{i}.input_layernorm.weight'])
-        layer.attention.wq.weight = nn.Parameter(permute_reverse(hf_dict[f'model.layers.{i}.self_attn.q_proj.weight']))
-        layer.attention.wk.weight = nn.Parameter(permute_reverse(hf_dict[f'model.layers.{i}.self_attn.k_proj.weight']))
-        layer.attention.wv.weight = nn.Parameter(hf_dict[f'model.layers.{i}.self_attn.v_proj.weight'])
-        layer.attention.wo.weight = nn.Parameter(hf_dict[f'model.layers.{i}.self_attn.o_proj.weight'])
-        layer.ffn_norm.weight = nn.Parameter(hf_dict[f'model.layers.{i}.post_attention_layernorm.weight'])
-        layer.feed_forward.w1.weight = nn.Parameter(hf_dict[f'model.layers.{i}.mlp.gate_proj.weight'])
-        layer.feed_forward.w2.weight = nn.Parameter(hf_dict[f'model.layers.{i}.mlp.down_proj.weight'])
-        layer.feed_forward.w3.weight = nn.Parameter(hf_dict[f'model.layers.{i}.mlp.up_proj.weight'])
-
-    # final classifier
-    model.output.weight = nn.Parameter(hf_dict['lm_head.weight'])
-    model.eval()
-    return model
-
-
-# -----------------------------------------------------------------------------
-# API entrypoint
-
-def model_export(model, filepath, version):
-    if version == 0:
-        legacy_export(model, filepath)
-    elif version == 1:
-        version1_export(model, filepath)
-    elif version == 2:
-        version2_export(model, filepath)
-    else:
-        raise ValueError(f"unknown version {version}")
-
-def torchscript_export(model, filepath, zero_params=False, gzip_output=False):
-    """
-    (This was submitted via a PR earlier. Leaving it here, but "orphaned" for now)
-    Saves the model as a TorchScript.
-    The resulting file can be loaded in C++ code and then used for training or
-    inference with:
-        #include <torch/script.h>
-        torch::jit::Module module = torch::jit::load("model.pt")
-    Note that the serialized model includes the initial parameters and with the default
-    ModelArgs the file is 59M and gzips down to 55M. If you want to serialize/distribute
-    the model parameters separately you can zero out the parameters before saving it and
-    it will gzip down to 780K.
-    """
-
-    # If requested zero params before saving the model. This is useful in
-    # conjunction with gzip_output.
-    if zero_params:
-        for p in model.parameters():
-            p.detach().zero_()
-
-    torch.jit.save(torch.jit.script(model), filepath)
-
-    if gzip_output:
-        with open(filepath, "rb") as f_in:
-            with gzip.open(f"{filepath}.gz", "wb") as f_out:
-                shutil.copyfileobj(f_in, f_out)
-        os.unlink(filepath)
-
-# -----------------------------------------------------------------------------
-# CLI entrypoint
-
-if __name__ == "__main__":
-
-    parser = argparse.ArgumentParser()
-    parser.add_argument("filepath", type=str, help="the output filepath")
-    parser.add_argument("--version", default=0, type=int, help="the version to export with")
-    group = parser.add_mutually_exclusive_group(required=True)
-    group.add_argument("--checkpoint", type=str, help="model checkpoint, .pt file")
-    group.add_argument("--meta-llama", type=str, help="meta llama model path")
-    group.add_argument("--hf", type=str, help="huggingface model path")
-    args = parser.parse_args()
-
-    if args.checkpoint:
-        model = load_checkpoint(args.checkpoint)
-    elif args.meta_llama:
-        model = load_meta_model(args.meta_llama)
-    elif args.hf:
-        model = load_hf_model(args.hf)
-
-    if model is None:
-        parser.error("Can't load input model!")
-
-    # export
-    model_export(model, args.filepath, args.version)
@@ -0,0 +1,112 @@
+"""
+This script exports the Llama 2 weights in llama2c.bin format.
+"""
+import os
+import sys
+import struct
+from pathlib import Path
+import json
+
+import torch
+
+from model import precompute_freqs_cis
+
+
+def export(p, state_dict, filepath='model.bin'):
+    """export the model weights in fp32 into .bin file to be read from C"""
+    f = open(filepath, 'wb')
+
+    def serialize(key):
+        print(f"writing {key}...")
+        t = state_dict[key].contiguous().view(-1).type(torch.float32).numpy()
+        f.write(memoryview(t))
+        del state_dict[key]
+
+    # first write out the header
+    hidden_dim = state_dict['layers.0.feed_forward.w1.weight'].shape[0]
+    p['vocab_size'] = 32000
+    p['max_seq_len'] = 2048
+
+    n_kv_heads = p.get('n_kv_heads') or p['n_heads']
+    header = struct.pack(
+        'iiiiiii',
+        p['dim'], hidden_dim, p['n_layers'], p['n_heads'],
+        n_kv_heads, -p['vocab_size'], p['max_seq_len']
+    )
+    # NOTE ABOVE: -ve vocab_size is indicating that the classifier weights are present
+    # in the checkpoint and should be loaded.
+    f.write(header)
+
+    # next write out the embedding weights
+    print("writing tok_embeddings...")
+    serialize('tok_embeddings.weight')
+
+    # now all the layers
+    # attention weights
+    for i in range(p['n_layers']): serialize(f'layers.{i}.attention_norm.weight')
+    for i in range(p['n_layers']): serialize(f'layers.{i}.attention.wq.weight')
+    for i in range(p['n_layers']): serialize(f'layers.{i}.attention.wk.weight')
+    for i in range(p['n_layers']): serialize(f'layers.{i}.attention.wv.weight')
+    for i in range(p['n_layers']): serialize(f'layers.{i}.attention.wo.weight')
+    # ffn weights
+    for i in range(p['n_layers']): serialize(f'layers.{i}.ffn_norm.weight')
+    for i in range(p['n_layers']): serialize(f'layers.{i}.feed_forward.w1.weight')
+    for i in range(p['n_layers']): serialize(f'layers.{i}.feed_forward.w2.weight')
+    for i in range(p['n_layers']): serialize(f'layers.{i}.feed_forward.w3.weight')
+
+    # final rmsnorm
+    serialize('norm.weight')
+    # freqs_cos, freqs_sin
+    freqs_cos, freqs_sin = precompute_freqs_cis(p['dim'] // p['n_heads'], p['max_seq_len'] * 2)
+    state_dict['freqs_cos'] = freqs_cos[:p['max_seq_len']]
+    state_dict['freqs_sin'] = freqs_sin[:p['max_seq_len']]
+    serialize('freqs_cos')
+    serialize('freqs_sin')
+
+    # finally write the output weights
+    serialize('output.weight')
+
+    f.close()
+    print(f"wrote {filepath}")
+
+
+def concat_weights(models):
+    state_dict = {}
+    for name in list(models[0]):
+        tensors = [model[name] for model in models]
+        if len(tensors) == 1 or len(tensors[0].shape) == 1:
+            state_dict[name] = tensors[0]
+            continue
+        is_axis_1 = (
+            name.startswith('tok_embeddings.')
+            or name.endswith('.attention.wo.weight')
+            or name.endswith('.feed_forward.w2.weight')
+        )
+        axis = 1 if is_axis_1 else 0
+        state_dict[name] = torch.cat(tensors, dim=axis)
+        for model in models:
+            del model[name]
+    return state_dict
+
+
+def load_and_export(model_path, output_path):
+    params_path = os.path.join(model_path, 'params.json')
+    with open(params_path) as f:
+        params = json.load(f)
+        print(params)
+
+    model_paths = sorted(list(Path(model_path).glob('consolidated.*.pth')))
+    models = [torch.load(p, map_location='cpu') for p in model_paths]
+    state_dict = concat_weights(models)
+    del models
+    export(params, state_dict, output_path)
+
+
+if __name__ == '__main__':
+    if len(sys.argv) == 1:
+        print('[Llama model folder path] [output path]')
+        exit()
+
+    model_path = sys.argv[1]
+    output_path = sys.argv[2]
+    load_and_export(model_path, output_path)
@@ -0,0 +1,113 @@
+"""
+This script exports the Llama 2 weights in llama2c.bin format.
+"""
+import os
+import sys
+import struct
+from pathlib import Path
+import json
+
+import torch
+
+from model import precompute_freqs_cis
+
+
+def export(p, state_dict, filepath='model.bin'):
+    """export the model weights in fp32 into .bin file to be read from C"""
+    f = open(filepath, 'wb')
+
+    def serialize(key):
+        print(f"writing {key}...")
+        t = state_dict[key].contiguous().view(-1).type(torch.float32).numpy()
+        f.write(memoryview(t))
+        del state_dict[key]
+
+    # first write out the header
+    hidden_dim = state_dict['model.layers.0.mlp.gate_proj.weight'].shape[0]
+    p['vocab_size'] = 32000
+    p['max_seq_len'] = 2048
+
+    n_kv_heads = p.get('n_kv_heads') or p['n_heads']
+    header = struct.pack(
+        'iiiiiii',
+        p['dim'], hidden_dim, p['n_layers'], p['n_heads'],
+        n_kv_heads, -p['vocab_size'], p['max_seq_len']
+    )
+    # NOTE ABOVE: -ve vocab_size is indicating that the classifier weights are present
+    # in the checkpoint and should be loaded.
+    f.write(header)
+
+    # next write out the embedding weights
+    print("writing tok_embeddings...")
+    serialize('model.embed_tokens.weight')
+
+    # now all the layers
+    # attention weights
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.input_layernorm.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.q_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.k_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.v_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.o_proj.weight')
+    # ffn weights
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.post_attention_layernorm.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.gate_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.down_proj.weight')
+    for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.up_proj.weight')
+
+    # final rmsnorm
+    serialize('model.norm.weight')
+    # freqs_cos, freqs_sin
+    freqs_cos, freqs_sin = precompute_freqs_cis(p['dim'] // p['n_heads'], p['max_seq_len'] * 2)
+    state_dict['freqs_cos'] = freqs_cos[:p['max_seq_len']]
+    state_dict['freqs_sin'] = freqs_sin[:p['max_seq_len']]
+    # check if this requires addtional conversion
+    serialize('freqs_cos')
+    serialize('freqs_sin')
+
+    # finally write the output weights
+    serialize('lm_head.weight')
+
+    f.close()
+    print(f"wrote {filepath}")
+
+
+def concat_weights(models):
+    state_dict = {}
+    for name in list(models[0]):
+        tensors = [model[name] for model in models]
+        if len(tensors) == 1 or len(tensors[0].shape) == 1:
+            state_dict[name] = tensors[0]
+            continue
+        is_axis_1 = (
+            name.startswith('model.embed_tokens.weight')
+            or name.endswith('.self_attn.o_proj.weight')
+            or name.endswith('.mlp.down_proj.weight')
+        )
+        axis = 1 if is_axis_1 else 0
+        state_dict[name] = torch.cat(tensors, dim=axis)
+        for model in models:
+            del model[name]
+    return state_dict
+
+
+def load_and_export(model_path, output_path):
+    params_path = os.path.join(model_path, 'params.json')
+    with open(params_path) as f:
+        params = json.load(f)
+        print(params)
+
+    model_paths = sorted(list(Path(model_path).glob('consolidated.*.pth')))
+    models = [torch.load(p, map_location='cpu') for p in model_paths]
+    state_dict = concat_weights(models)
+    del models
+    export(params, state_dict, output_path)
+
+
+if __name__ == '__main__':
+    if len(sys.argv) == 1:
+        print('[Llama model folder path] [output path]')
+        exit()
+
+    model_path = sys.argv[1]
+    output_path = sys.argv[2]
+    load_and_export(model_path, output_path)
@@ -17,7 +17,6 @@ class ModelArgs:
    n_heads: int = 32
    n_kv_heads: Optional[int] = None
    vocab_size: int = 32000
-    hidden_dim: Optional[int] = None
    multiple_of: int = 256  # MLP hidden layer size will be multiple of
    norm_eps: float = 1e-5
    max_seq_len: int = 2048
@@ -167,10 +166,8 @@ class Attention(nn.Module):
 class FeedForward(nn.Module):
    def __init__(self, dim: int, hidden_dim: int, multiple_of: int, dropout: float):
        super().__init__()
-        if hidden_dim is None:
-            hidden_dim = 4 * dim
-            hidden_dim = int(2 * hidden_dim / 3)
-            hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
+        hidden_dim = int(2 * hidden_dim / 3)
+        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
@@ -189,7 +186,7 @@ class TransformerBlock(nn.Module):
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim,
-            hidden_dim=args.hidden_dim,
+            hidden_dim=4 * args.dim,
            multiple_of=args.multiple_of,
            dropout=args.dropout,
        )
@@ -341,3 +338,127 @@ class Transformer(nn.Module):
            idx = torch.cat((idx, idx_next), dim=1)

        return idx
+
+    def export(self, filepath='model.bin'):
+        """export the model weights in Q8_0 into .bin file to be read from C"""
+        out_file = open(filepath, 'wb')
+
+        # find the max group size that fits hidden_dim using backoff
+        group_size = 64 # a good desired group size default
+        while self.params.dim % group_size != 0:
+            group_size //= 2
+        print(f"using group size {group_size} for quantization")
+
+        def serialize_fp32(t):
+            """ writes one fp32 tensor to file """
+            d = t.detach().cpu().view(-1).numpy().astype(np.float32)
+            b = struct.pack(f'{len(d)}f', *d)
+            out_file.write(b)
+
+        def serialize_int8(t):
+            """ writes one int8 tensor to file """
+            d = t.detach().cpu().view(-1).numpy().astype(np.int8)
+            b = struct.pack(f'{len(d)}b', *d)
+            out_file.write(b)
+
+        def quantize_q80(w):
+            """
+            takes a tensor and returns the Q8_0 quantized version
+            i.e. symmetric quantization into int8, range [-127,127]
+            """
+            assert w.numel() % group_size == 0
+            ori_shape = w.shape
+            w = w.float() # convert to float32
+            w = w.reshape(-1, group_size)
+            # find the max in each group
+            wmax = torch.abs(w).max(dim=1).values
+            # calculate the scaling factor such that float = quant * scale
+            scale = wmax / 127.0
+            # scale into range [-127, 127]
+            quant = w / scale[:,None]
+            # round to nearest integer
+            int8val = torch.round(quant).to(torch.int8)
+            # dequantize by rescaling
+            fp32val = (int8val.float() * scale[:,None]).view(-1)
+            fp32valr = fp32val.reshape(-1, group_size)
+            # calculate the max error in each group
+            err = torch.abs(fp32valr - w).max(dim=1).values
+            # find the max error across all groups
+            maxerr = err.max().item()
+            return int8val, scale, maxerr
+
+        # first write out the header. the header will be 256 bytes
+        nbytes = 0
+        # 1) write magic, which will be uint32 of "ak42" in ASCII
+        out_file.write(struct.pack('I', 0x616b3432))
+        nbytes += 4
+        # 2) write version, which will be int
+        out_file.write(struct.pack('i', 1))
+        nbytes += 4
+        # 3) write the params, which will be 7 ints
+        p = self.params
+        hidden_dim = self.layers[0].feed_forward.w1.weight.shape[0]
+        n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
+        header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
+                                       n_kv_heads, p.vocab_size, p.max_seq_len)
+        out_file.write(header)
+        nbytes += 7*4
+        # 4) write some other flags
+        shared_classifier = 1 # we do share a classifier, write flag as a byte
+        out_file.write(struct.pack('B', shared_classifier))
+        nbytes += 1
+        out_file.write(struct.pack('i', group_size)) # group size used for quantization
+        nbytes += 4
+        pad = 256 - nbytes # pad the rest with zeros
+        assert pad >= 0
+        out_file.write(b'\0' * pad)
+        # now that the header is done, let's write out the model
+
+        # first let's write out all the params that we are keeping in fp32: the norms
+        for layer in self.layers: # attention norms
+            serialize_fp32(layer.attention_norm.weight)
+        for layer in self.layers: # MLP norms
+            serialize_fp32(layer.ffn_norm.weight)
+        serialize_fp32(self.norm.weight) # final pre-classifier norm
+
+        # now let's write out all the params that we are quantizing to Q8_0
+        # note we skip classifier weights, which are shared with the embedding
+        weights = [
+            self.tok_embeddings.weight,
+            *[layer.attention.wq.weight for layer in self.layers],
+            *[layer.attention.wk.weight for layer in self.layers],
+            *[layer.attention.wv.weight for layer in self.layers],
+            *[layer.attention.wo.weight for layer in self.layers],
+            *[layer.feed_forward.w1.weight for layer in self.layers],
+            *[layer.feed_forward.w2.weight for layer in self.layers],
+            *[layer.feed_forward.w3.weight for layer in self.layers],
+        ]
+
+        ew = []
+        scales = []
+        for i, w in enumerate(weights):
+            assert w.numel() % group_size == 0, f"weight {i} has numel {w.numel()}, not a multiple of group_size {group_size}"
+
+            # quantize this weight
+            q, s, err = quantize_q80(w)
+
+            # save to file
+            serialize_int8(q) # save the tensor in int8
+            scales.append(s)  # we'll do all the scales after all the qs
+
+            # logging
+            ew.append((err, w.shape))
+            print(f"{i+1}/{len(weights)} quantized {tuple(w.shape)} to Q8_0 with max error {err}")
+
+        # save the scaling factors in fp32 here
+        # this is done to keep all the weights contiquous, making pointer arithmetic easier in C
+        for s in scales:
+            serialize_fp32(s)
+
+        # print the highest error across all weights, should be very small, e.g. O(~0.001)
+        ew.sort(reverse=True)
+        print(f"max quantization group error across all weights: {ew[0][0]}")
+
+        # write to binary file
+        out_file.close()
+        print(f"wrote {filepath}")
@@ -52,7 +52,7 @@ if compile:
    model = torch.compile(model) # requires PyTorch 2.0 (optional)

 # load the tokenizer
-vocab_source = checkpoint_dict["config"].get("vocab_source", "llama2")
+vocab_source = checkpoint_dict.get("vocab_source", "llama2")
 vocab_size = gptconf.vocab_size
 if tokenizer:
    # a specific tokenizer is provided, use it
@@ -0,0 +1,66 @@
+#!/usr/bin/env python
+"""Saves the model as a TorchScript.
+
+Usage examples:
+    ./save_torchscript.py
+    ./save_torchscript.py --dim=300
+    ./save_torchscript.py --gzip_output=True --zero_params=True
+
+The resulting file can be loaded in C++ code and then used for training or
+inference with:
+    #include <torch/script.h>
+    torch::jit::Module module = torch::jit::load("model.pt")
+
+Note that the serialized model includes the initial parameters and with the default
+ModelArgs the file is 59M and gzips down to 55M. If you want to serialize/distribute
+the model parameters separately you can zero out the parameters before saving it and
+it will gzip down to 780K.
+"""
+import gzip
+import os
+import shutil
+from inspect import signature
+
+import torch
+
+from model import ModelArgs, Transformer
+
+# Model args config
+dim = 288
+n_layers = 6
+n_heads = 6
+n_kv_heads = n_heads
+multiple_of = 32
+max_seq_len = 256
+dropout = 0.0
+vocab_size = 32000
+norm_eps = 1e-5
+# Save config
+model_path = "model.pt"
+zero_params = False
+gzip_output = False
+# Allow config overrides
+exec(open("configurator.py").read())
+
+
+def main() -> None:
+    model_args = {k: globals()[k] for k in signature(ModelArgs).parameters}
+    model = Transformer(ModelArgs(**model_args))
+
+    # If requested zero params before saving the model. This is useful in
+    # conjunction with gzip_output.
+    if zero_params:
+        for p in model.parameters():
+            p.detach().zero_()
+
+    torch.jit.save(torch.jit.script(model), model_path)
+
+    if gzip_output:
+        with open(model_path, "rb") as f_in:
+            with gzip.open(f"{model_path}.gz", "wb") as f_out:
+                shutil.copyfileobj(f_in, f_out)
+        os.unlink(model_path)
+
+
+if __name__ == "__main__":
+    main()
@@ -1,81 +0,0 @@
-#define TESTING
-#include "run.c"
-
-void assert_eq(int a, int b) {
-    if (a != b) {
-        printf("Assertion failed: %d != %d\n", a, b);
-        exit(EXIT_FAILURE);
-    }
-}
-
-void test_prompt_encoding(Tokenizer* tokenizer, char* prompt, int* expected_tokens, int num_expected_tokens) {
-    // encode
-    int* prompt_tokens = (int*)malloc((strlen(prompt)+3) * sizeof(int));
-    int num_prompt_tokens = 0; // the total number of prompt tokens
-    encode(tokenizer, prompt, 1, 0, prompt_tokens, &num_prompt_tokens);
-
-    #if VERBOSITY == 1
-    // print maybe
-    printf("expected tokens:\n");
-    for (int i = 0; i < num_expected_tokens; i++) printf("%d ", expected_tokens[i]);
-    printf("\n");
-    printf("actual tokens:\n");
-    for (int i = 0; i < num_prompt_tokens; i++) printf("%d ", prompt_tokens[i]);
-    printf("\n");
-    #endif
-
-    // verify
-    assert_eq(num_prompt_tokens, num_expected_tokens);
-    for (int i = 0; i < num_prompt_tokens; i++) {
-        assert_eq(prompt_tokens[i], expected_tokens[i]);
-    }
-
-    #if VERBOSITY == 1
-    printf("OK\n");
-    printf("---\n");
-    #endif
-    free(prompt_tokens);
-}
-
-void test_prompt_encodings() {
-    // let's verify that the Tokenizer works as expected
-
-    char *tokenizer_path = "tokenizer.bin";
-    int vocab_size = 32000;
-    Tokenizer tokenizer;
-    build_tokenizer(&tokenizer, tokenizer_path, vocab_size);
-
-    // test 0 (test the empty string) (I added this as a simple case)
-    char *prompt0 = "";
-    int expected_tokens0[] = {1};
-    test_prompt_encoding(&tokenizer, prompt0, expected_tokens0, sizeof(expected_tokens0) / sizeof(int));
-
-    // the tests below are taken from the Meta Llama 2 repo example code
-    // https://github.com/facebookresearch/llama/blob/main/example_text_completion.py
-    // and the expected tokens come from me breaking in the debugger in Python
-
-    // test 1
-    char *prompt = "I believe the meaning of life is";
-    int expected_tokens[] = {1, 306, 4658, 278, 6593, 310, 2834, 338};
-    test_prompt_encoding(&tokenizer, prompt, expected_tokens, sizeof(expected_tokens) / sizeof(int));
-
-    // test 2
-    char* prompt2 = "Simply put, the theory of relativity states that ";
-    int expected_tokens2[] = {1, 3439, 17632, 1925, 29892, 278, 6368, 310, 14215, 537, 5922, 393, 29871};
-    test_prompt_encoding(&tokenizer, prompt2, expected_tokens2, sizeof(expected_tokens2) / sizeof(int));
-
-    // test 3
-    char* prompt3 = "A brief message congratulating the team on the launch:\n\n        Hi everyone,\n\n        I just ";
-    int expected_tokens3[] = {1, 319, 11473, 2643, 378, 629, 271, 18099, 278, 3815, 373, 278, 6826, 29901, 13, 13, 4706, 6324, 14332, 29892, 13, 13, 4706, 306, 925, 29871};
-    test_prompt_encoding(&tokenizer, prompt3, expected_tokens3, sizeof(expected_tokens3) / sizeof(int));
-
-    // test 4
-    char* prompt4 = "Translate English to French:\n\n        sea otter => loutre de mer\n        peppermint => menthe poivrée\n        plush girafe => girafe peluche\n        cheese =>";
-    int expected_tokens4[] = {1, 4103, 9632, 4223, 304, 5176, 29901, 13, 13, 4706, 7205, 4932, 357, 1149, 301, 449, 276, 316, 2778, 13, 4706, 1236, 407, 837, 524, 1149, 6042, 354, 772, 440, 29878, 1318, 13, 4706, 715, 1878, 330, 3055, 1725, 1149, 330, 3055, 1725, 4639, 28754, 13, 4706, 923, 968, 1149};
-    test_prompt_encoding(&tokenizer, prompt4, expected_tokens4, sizeof(expected_tokens4) / sizeof(int));
-}
-
-int main(int argc, char *argv[]) {
-    test_prompt_encodings();
-    printf("ALL OK\n");
-}
@@ -13,7 +13,6 @@ from functools import partial

 import numpy as np
 import requests
-import sentencepiece as spm
 import torch
 import torch.distributed as dist
 from tqdm import tqdm
@@ -98,21 +97,16 @@ def train_vocab(vocab_size):
                of.write(text + "\n")
    print(f"Size is: {os.path.getsize(tiny_file) / 1024 / 1024:.2f} MB")

-    # 2) train the sentencepiece model
-    print("Will now train the vocab...")
-    spm.SentencePieceTrainer.train(input=tiny_file,
-                                   model_prefix=prefix,
-                                   model_type="bpe",
-                                   vocab_size=vocab_size,
-                                   self_test_sample_size=0,
-                                   input_format="text",
-                                   character_coverage=1.0,
-                                   num_threads=os.cpu_count(),
-                                   split_digits=True,
-                                   allow_whitespace_only_pieces=True,
-                                   byte_fallback=True,
-                                   unk_surface=r" \342\201\207 ",
-                                   normalization_rule_name="identity")
+    # 2) run the train_vocab.sh script that trains the sentencepiece model
+    print("Will now train the vocab with:")
+    cmd = f"bash train_vocab.sh {tiny_file} {prefix} {vocab_size}"
+    print(cmd)
+    print("OK? [y/N] ")
+    dec = input()
+    if dec.lower() != "y":
+        print("Exiting...")
+        return
+    os.system(cmd)

    # 3) optional cleanup, ask the user if they'd like to delete tiny.txt
    dec = input(f"Delete the temporary file {tiny_file}? [y/N] ")
@@ -202,7 +196,6 @@ class PretokDataset(torch.utils.data.IterableDataset):
            shard_filenames = sorted(glob.glob(os.path.join(bin_dir, "*.bin")))
        # train/test split. let's use only shard 0 for test split, rest train
        shard_filenames = shard_filenames[1:] if self.split == "train" else shard_filenames[:1]
-        assert len(shard_filenames)>0, f"No bin files found in {bin_dir}"
        while True:
            rng.shuffle(shard_filenames)
            for shard in shard_filenames:
@@ -29,7 +29,6 @@ from torch.distributed import destroy_process_group, init_process_group
 from torch.nn.parallel import DistributedDataParallel as DDP

 from tinystories import Task
-from export import model_export

 # -----------------------------------------------------------------------------
 # I/O
@@ -271,7 +270,7 @@ while True:
                        "loss/val": losses["val"],
                        "lr": lr,
                        "mfu": running_mfu * 100,  # convert to percentage
-                    }, step = iter_num
+                    }
                )
            except Exception as e:
                print(f"logging to wandb failed: {e}")
@@ -288,7 +287,7 @@ while True:
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, "ckpt.pt"))
-                model_export(raw_model, os.path.join(out_dir, "model.bin"), version=0)
+                raw_model.export(os.path.join(out_dir, "model.bin"))
    if iter_num == 0 and eval_only:
        break
Author	SHA1	Message	Date
Andrej Karpathy	039a9713c2	ok this first version works but i don't think is ready to merge, have to think on more	2023-08-18 15:44:02 +00:00
Andrej Karpathy	591f1353c7	ok this works but is super slow because we are doing all the work in fp32 still	2023-08-18 03:40:18 +00:00
Andrej Karpathy	e9cbe3e84f	small improvements to comments and warnings and increase header size during model export	2023-08-17 14:32:22 +00:00
Andrej Karpathy	5e2e5b28f4	re-write the model export to do int8 quantization in groups, with group size fallback, and also change the header to be much better	2023-08-17 05:56:20 +00:00