86 Commits

Author SHA1 Message Date
Andrej 7325bab657 Merge pull request #365 from atamurad/patch-1
Update README.md - unclosed code block quotes
2023-08-26 20:11:04 -07:00
Atamurad Hezretkuliyev 37157bc0a3 Update README.md
Fixed unclosed code block quotes
2023-08-27 02:27:47 +03:00
Andrej Karpathy f4b8a81742 Merge branch 'master' of github.com:karpathy/llama2.c 2023-08-26 21:22:28 +00:00
Andrej Karpathy 91d57db925 add note on code llama being a bit wrong 2023-08-26 21:22:19 +00:00
Andrej f856539f41 Merge pull request #363 from byte-6174/patch-1
fix tinyllamas url
2023-08-26 14:13:20 -07:00
byte-6174 b5a0b65dbf fix tinyllamas url 2023-08-26 17:05:21 -04:00
Andrej 7b0017c6cd Merge pull request #362 from byte-6174/upmaster
freeing tokenizer in test.c
2023-08-26 14:03:31 -07:00
Andrej Karpathy 50832e3dff move script into the new docs folder 2023-08-26 21:02:23 +00:00
Andrej Karpathy 1386edfd90 add docs on stories260K 2023-08-26 20:52:49 +00:00
Aniket 32cecbfe4a freeing tokenizer in test.c 2023-08-26 16:35:50 -04:00
Andrej e47bacdc62 Merge pull request #355 from janimo/export-vocab-size
Export vocab size and Code Llama usage docs
2023-08-26 13:24:55 -07:00
Jani Monoses 604d3c59c0 Add Code Llama info 2023-08-26 22:36:09 +03:00
Jani Monoses 2c2b284988 Get vocab_size from token embeddings size 2023-08-26 22:35:55 +03:00
Andrej 49daf18f2f Merge pull request #343 from karpathy/feature/chat
Add interactive loop to enable nice chat with a Llama 2 Chat model
2023-08-25 08:00:11 -07:00
Andrej 4a7a62bd21 Merge branch 'master' into feature/chat 2023-08-25 07:58:33 -07:00
Andrej 5c6427e4d7 Merge pull request #352 from dmarcos/readmeTypo
Fix typo in README.md
2023-08-25 07:55:54 -07:00
Andrej cbc2488b82 Merge pull request #353 from photomz/master
Clearer WandB log step
2023-08-25 07:55:26 -07:00
Andrej Karpathy fbe324fc5a adjust things a bit 2023-08-25 14:54:05 +00:00
Markus Zhang 6def77d4ba Correct WandB log step 2023-08-25 17:12:29 +08:00
Diego Marcos Segura 19cfbeca71 Fix typo in README.md 2023-08-24 19:46:43 -07:00
Andrej d7cd98633d add todo item to add a PyTorch Engine 2023-08-24 09:04:52 -07:00
Andrej Karpathy 3d787b2463 ok getting closer, and manually verified correctness of the schema matching python. still some weirdness in the printing to chase down, and also have to tune the buffer lengths and make them sensible and such 2023-08-24 04:31:06 +00:00
Andrej Karpathy 40fb902cf0 fix chat format bug i think 2023-08-24 03:33:44 +00:00
Andrej Karpathy c7a26264a2 Merge branch 'master' of github.com:karpathy/llama2.c 2023-08-24 03:10:18 +00:00
Andrej Karpathy 446c1c0df3 Merge branch 'janimo-train-vocab-python' 2023-08-24 03:10:07 +00:00
Andrej Karpathy 096325b66c bring back num_threads 2023-08-24 03:09:55 +00:00
Andrej 90104db721 Merge pull request #348 from nehzata/clip_steps
Clip steps maximum value
2023-08-23 19:57:01 -07:00
Ali Nehzat 9bc72acab0 steps shouldn't exceed the model's seq_len either 2023-08-24 09:09:16 +10:00
Andrej Karpathy c5e0e7fce4 attempt at chat function, but it was 8AM and I didn't have coffee yet. Seems to work but it's probably subtly broken or too complex. version 1 only, lots of hard-coded non-sensical buffer sizes. Have to go to work now 2023-08-23 16:27:48 +00:00
Jani Monoses fe9b9f2f15 Train vocab in Python 2023-08-23 19:10:28 +03:00
Andrej Karpathy 7ac65cb2c2 make decode safer and fix issue with skipping bad byte tokens 2023-08-23 01:08:31 +00:00
Andrej Karpathy 4b3e66021a lol text 2023-08-23 00:26:47 +00:00
Andrej Karpathy d1eb18b8ec add BOS and EOS function to the Tokenizer as we start to converge closer to the Llama 2 code from Meta, and as we're about to add the Chat capability 2023-08-23 00:08:22 +00:00
Andrej Karpathy d26a499207 absorb our rng state into the Sampler. I feel that this is correct because it makes our use of entropy very explicit and localized, and the sampler is now well-contained without any global state. Code is increasingly more beautiful. 2023-08-22 03:22:56 +00:00
Andrej Karpathy ac6cf8d6e8 tweak todo list 2023-08-22 02:48:51 +00:00
Andrej Karpathy ad7a1ef525 clean up swiglu a little bit 2023-08-22 02:32:21 +00:00
Andrej Karpathy 0e362f735f and finallygit add run.c split off the generate function. alongside it will come a chat function. we are close 2023-08-22 02:22:36 +00:00
Andrej Karpathy d73b917d3b hide temperature and topp into the sampler, it's a little bit less flexible but a little bit more cleaner 2023-08-22 02:17:51 +00:00
Andrej Karpathy 379f083b85 make sorted vocab a buffer of Tokenizer 2023-08-22 01:56:51 +00:00
Andrej 5eaca535cd Merge pull request #335 from ozabluda/ozabluda-patch-5
Remove unneeded check of free(NULL)
2023-08-21 18:16:07 -07:00
Andrej Karpathy 83287ff254 fix steps=0 is max context 2023-08-22 01:15:00 +00:00
Oleg Zabluda c2834c8a1f Remove unneeded check of free(NULL)
Passing NULL to free() is totally allowed
2023-08-21 10:54:53 -07:00
Andrej ee95b1bf29 Merge pull request #315 from davidar/vocab_source
Fix vocab_source in sample.py
2023-08-21 08:26:28 -07:00
Andrej Karpathy d02e0c90d8 Merge branch 'rdentato-patch-check-params' 2023-08-21 15:17:37 +00:00
Andrej Karpathy 33d94f60a5 parameter validation cleanup 2023-08-21 15:17:14 +00:00
Remo Dentato 2d972f1763 Merge branch 'karpathy:master' into patch-check-params 2023-08-21 17:02:42 +02:00
Andrej 8a3ea7b433 Merge pull request #329 from atamurad/import_meta
Moved export_meta_llama_bin.py to new export.py
2023-08-21 07:34:32 -07:00
atamyrat 61c26d5392 Updated README to replace export_meta_llama_bin.py script with export.py 2023-08-21 14:24:01 +03:00
atamyrat 36a78af5e1 tested load_meta_model() in export.py, deleting old export_meta_llama_bin.py file 2023-08-21 14:19:56 +03:00
atamyrat de005474d3 Added load_meta_model() to export.py 2023-08-21 14:13:47 +03:00
rdentato 4444575c4e Added check of generation parameters. 2023-08-21 06:43:39 +00:00
Andrej Karpathy dd61b13e57 delete the save_torchscript export file, but copy its content to the new export.py for the future maybe 2023-08-21 05:09:06 +00:00
Andrej Karpathy ea44f53568 now that the export.py HF functionality is in master, we can delete this file, and update the readme 2023-08-21 04:58:19 +00:00
Andrej 801c68f5a1 Merge pull request #326 from atamurad/import_hf
Added huggingface model loader/importer to export.py
2023-08-20 21:53:17 -07:00
Andrej 74a68eeb35 Merge pull request #325 from HarryGifford/users/hegi/update-readme-threading
Update readme with suggestion on number of threads to use
2023-08-20 21:50:26 -07:00
Andrej Karpathy 288b3cec09 remove dagger in the eyeball 2023-08-21 04:47:49 +00:00
Andrej Karpathy 14275bd623 minor clean. i think a lot of chaos has been reduced for today. we shall now rest. 2023-08-21 04:43:24 +00:00
Andrej Karpathy 3868f732a4 and finally refactor the Sampler. things are starting to look a lot cleaner I think 2023-08-21 04:23:02 +00:00
Andrej Karpathy 8a377a1d31 refactor the Transformer (Config, Weights, RunState) into a single object, with build and free too 2023-08-21 03:55:12 +00:00
Andrej Karpathy ae2e4f8d88 name the tokenizer methods cleaner: encode and decode 2023-08-21 03:11:54 +00:00
atamyrat 0dd82158f6 removed transformers from requirements.txt, added error message 2023-08-21 06:07:29 +03:00
atamyrat 155475a523 Fix WQ and WK permutation in huggingface models 2023-08-21 05:16:11 +03:00
atamyrat d7704bdeaa mark ModelArgs.hidden_dim as optional and calculate as previously if not provided 2023-08-21 03:40:34 +03:00
atamyrat 09db52c69e Added huggingface model loader to export.py 2023-08-21 02:59:12 +03:00
Harry Gifford a72b3b0206 Update readme with suggestion on number of threads to use
Update the documentation to make suggestions on the number of threads. The performance difference can be very large. Also linked to the PyTorch docs which are relevant here.
2023-08-20 15:01:33 -07:00
Andrej Karpathy c74456f3f0 refactor step 1. the tokenizer, and all the other abstractions, are a total mess, refactoring things a bit 2023-08-20 18:18:23 +00:00
Andrej Karpathy 1e335a41cf remove freq_cis fields as they are not used anymore 2023-08-20 17:26:43 +00:00
Andrej Karpathy c0511de617 probindex should never have been part of RunState. i apologize for this failure of abstraction 2023-08-20 17:18:06 +00:00
Andrej 8c93c7a30e Merge pull request #322 from karpathy/feature/export
New model export (the code remains "dead" and legacy version is still the default behavior, so no breaking changes are introduced). The major benefit is a new export.py file, which we can use to centralize work on formatting: both imports and exports.
2023-08-20 10:08:32 -07:00
Andrej Karpathy 13dcee493a todos update 2023-08-20 17:02:22 +00:00
Andrej Karpathy f3db92a2dc use out_file.tell() instead of nbytes += arithmetic 2023-08-20 16:51:35 +00:00
Andrej Karpathy fa8dfd854e isolate read_checkpoint, because i'd like to now make it support both version 0 and version 1 2023-08-19 19:21:12 +00:00
Andrej Karpathy 4df5e2e939 make version 1 be the legacy export but with new header. version 2 will be Q8_0 export 2023-08-19 18:51:32 +00:00
Andrej Karpathy 4212bd6d43 oops fix double indent on quantize def 2023-08-19 18:34:49 +00:00
Andrej Karpathy 7f551dbfd7 new model export: versions 0 (legacy) and 1 2023-08-19 18:25:20 +00:00
Andrej 6c5d78fa41 Merge pull request #317 from yiminghan/yhan/old
Add a link to Dart port in README
2023-08-19 10:01:08 -07:00
Andrej db1a722816 Merge pull request #318 from rahoua/master
YARP - Yet Another Rust Port in README.md
2023-08-19 10:00:56 -07:00
Andrej d2a546c577 Merge pull request #319 from RahulSChand/warning
Give better error message in Tinystories data loader
2023-08-19 10:00:27 -07:00
rahulschand fbefeec1b1 add assert message to give better warning 2023-08-19 13:05:26 +05:30
rahoua 978c311b30 Add pecca-rs to README.md 2023-08-18 14:58:21 -07:00
YiMing Han 882e480bc0 update read me 2023-08-18 15:18:29 -04:00
YiMing Han d09ebbb32b Revert "working one"
This reverts commit 8607b11ea1.
2023-08-18 15:14:08 -04:00
YiMing Han bc7cb7d0e8 Revert "only dart"
This reverts commit 01df3731d6.
2023-08-18 15:13:59 -04:00
YiMing Han 01df3731d6 only dart 2023-08-18 15:09:24 -04:00
YiMing Han 8607b11ea1 working one 2023-08-18 15:07:41 -04:00
David A Roberts 52fe3653e5 Fix vocab_source in sample.py 2023-08-18 18:40:25 +10:00
15 changed files with 1258 additions and 715 deletions
+8
View File
@@ -55,6 +55,14 @@ test:
testc:
pytest -k runc
# run the C tests, without touching pytest / python
# to increase verbosity level run e.g. as `make testcc VERBOSITY=1`
VERBOSITY ?= 0
.PHONY: testcc
testcc:
$(CC) -DVERBOSITY=$(VERBOSITY) -O3 -o testc test.c -lm
./testc
.PHONY: clean
clean:
rm -f run
+56 -13
View File
@@ -8,7 +8,7 @@ Train the Llama 2 LLM architecture in PyTorch then inference it with one simple
As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing.
Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compred to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.
Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compared to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.
## feel the magic
@@ -65,13 +65,13 @@ Quick note on sampling, the recommendation for ~best results is to sample with `
## Meta's Llama 2 models
As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format.
For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export_meta_llama_bin.py` file, e.g. for 7B model:
For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export.py` file, e.g. for 7B model:
```bash
python export_meta_llama_bin.py path/to/llama/model/7B llama2_7b.bin
python export.py llama2_7b.bin --meta-llama path/to/llama/model/7B
```
The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts, the 13B export currently doesn't work for unknown reasons (accepting PRs for fix). We can run the model as normal:
The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts. I would not attempt to run anything above 7B right now for two reasons: first, 13B+ currently doesn't work because of integer flow in pointer arithmetic, which is yet to be fixed, and second, even if it were fixed, this repo is doing float32 inference right now, so it would be fairly unusably slow. Once the export is done, we can run it:
```bash
./run llama2_7b.bin
@@ -83,6 +83,38 @@ This ran at about 4 tokens/s compiled with [OpenMP](#OpenMP) on 96 threads on my
base models... ¯\\_(ツ)_/¯. Since we can inference the base model, it should be possible to also inference the chat model quite easily, and have a conversation with it. And if we can find a way to run 7B more efficiently, we can start adding LoRA to our training script, and going wild with finetunes all within the repo!
You can also chat with the Llama Chat models. Export the chat model exactly as above:
```bash
python export.py llama2_7b_chat.bin --meta-llama /path/to/7B-chat
```
Then chat with it by specifying the chat mode using the `-m` flag, e.g.:
```bash
./run llama2_7b_chat.bin -m chat
```
You can also try Meta's Code Llama models even if support for them is incomplete. In particular, some hyperparameters changed (e.g. the constant in RoPE layer), so the inference is not exactly correct and a bit buggy right now. Looking into fixes. Make sure to build the tokenizer for the plain and instruct variants and pass it when doing inference.
```bash
python export.py codellama2_7b.bin --meta-llama /path/to/CodeLlama-7b
python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b/tokenizer.model
./run codellama2_7b.bin -z /path/to/CodeLlama-7b/tokenizer.bin
```
Chat with Code Llama Instruct:
```bash
python export.py codellama2_7b_instruct.bin --meta-llama /path/to/CodeLlama-7b-Instruct
python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b-Instruct/tokenizer.model
./run codellama2_7b_instruct.bin -m chat -z /path/to/CodeLlama-7b-Instruct/tokenizer.bin
```
## hugginface models
We can load any huggingface models that use the Llama 2 architecture. See the script [export.py](export.py) and the `--hf` flag to export the model .bin file.
## models
For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub [tinyllamas](https://huggingface.co/karpathy/tinyllamas), both in the original PyTorch .pt, and also in the llama2.c format .bin:
@@ -159,7 +191,7 @@ python tinystories.py train_vocab --vocab_size=4096
python tinystories.py pretokenize --vocab_size=4096
```
The `train_vocab` stage will call the `train_vocab.sh` script, which calls the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.
The `train_vocab` stage will call the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.
A quick note of interest is that vocab size of 4096 trained specifically on tinystories creates integer sequences with about the same sequence length per example as the default Llama 2 tokenizer of 32000 tokens! This means that our custom, tailored tokenizer is a lot better adapted to our specific text, and can compress it very effectively. So our trained models are smaller and faster.
@@ -203,8 +235,7 @@ You can also experiment with replacing `gcc` with `clang`.
If compiling with gcc, try experimenting with `-funroll-all-loops`, see PR [#183](https://github.com/karpathy/llama2.c/pull/183)
### OpenMP
Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
**OpenMP**. Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). Then you can compile with `make runomp`, which does:
```bash
@@ -217,7 +248,8 @@ When you run inference make sure to use OpenMP flags to set the number of thread
OMP_NUM_THREADS=4 ./run out/model.bin
```
Depending on your system resources you may want to tweak these hyperparameters and use more threads. But more is not always better, usually this is a bit U shaped.
Depending on your system resources you may want to tweak these hyperparameters and use more threads. But more is not always better, usually this is a bit U shaped. In particular, if your CPU has SMT (multithreading), try setting the number of threads to the number of physical cores rather than logical cores. The performance difference can be large due to cache thrashing and communication overhead. The PyTorch documentation [CPU specific optimizations
](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#cpu-specific-optimizations) has some good information that applies here too.
## platforms
@@ -238,6 +270,14 @@ $ pytest
This will currently invoke two tests inside `test_all.py`, which forward the model in both C and Python for 200 steps and check the output against a known good expected output. The tests currently run in only a few seconds, but will have to download and cache the stories260K models in a temporary `test` directory (only ~2MB download).
There are also some tests in C, in the file [test.c](test.c). You can run these with `make testcc`, or to see more stuff printed:
```
make testcc VERBOSITY=1
```
Call for help: help add more tests.
## ack
I trained the llama2.c storyteller models on a 4X A100 40GB box graciously provided by the excellent [Lambda labs](https://lambdalabs.com/service/gpu-cloud), thank you.
@@ -271,6 +311,7 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
- [llama2.rs](https://github.com/leo-du/llama2.rs) by @[leo-du](https://github.com/leo-du): A Rust port of this project
- [llama2-rs](https://github.com/danielgrittner/llama2-rs) by @[danielgrittner](https://github.com/danielgrittner): a Rust port of this project
- [llama2.rs](https://github.com/lintian06/llama2.rs) by @[lintian06](https://github.com/lintian06): A Rust port of this project
- [pecca.rs](https://github.com/rahoua/pecca-rs) by @[rahoua](https://github.com/rahoua): A Rust port leveraging [ndarray](https://github.com/rust-ndarray/ndarray), supports BLAS.
- Go
- [go-llama2](https://github.com/tmc/go-llama2) by @[tmc](https://github.com/tmc): a Go port of this project
- [llama2.go](https://github.com/nikolaydubina/llama2.go) by @[nikolaydubina](https://github.com/nikolaydubina): a Go port of this project
@@ -301,6 +342,8 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
- [llama2.py](https://github.com/tairov/llama2.py) by @[tairov](https://github.com/tairov): a simple one file pure Python port of this project with zero dependencies
- C#
- [llama2.cs](https://github.com/trrahul/llama2.cs) by @[trrahul](https://github.com/trrahul): a C# port of this project
- Dart
- [llama2.dart](https://github.com/yiminghan/llama2.dart) by @[yiminghan](https://github.com/yiminghan/llama2.dart): one-file dart port of this project, works with Flutter!
- WebAssembly
- [icpp-llm](https://github.com/icppWorld/icpp-llm): LLMs for the Internet Computer
- [llama2.c - Llama 2 Everywhere](https://github.com/trholding/llama2.c) by @[trholding](https://github.com/trholding): Standalone, Bootable & Portable Binary Llama 2
@@ -308,12 +351,12 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
## unsorted todos
- add support in run.c of reading version 1+ files from export, later deprecate "version 0"
- runq.c (int8 quantization) add
- run.cu (CUDA) investigate and merge
- add more tests inside [test.c](test.c)
- add Engine class for use in sample.py that does efficient inference in PyTorch, e.g. KV cache keeping
- make it easier to add a new dataset with not too much pain
- should calculate freq_cis online in the script run.c instead of loading them
- int4/8 quantization
- export the model in a more sensible output format with a proper header, etc.
- support Llama 2 7B Chat models and tune run.c to Chat UI/UX
- llama2.cu investigate and merge
- (LoRA) finetuning and export of Llama 2 models
## License
+58
View File
@@ -0,0 +1,58 @@
# stories260K
[Stories260K huggginface link](https://huggingface.co/karpathy/tinyllamas)
The 260K model is a tiny model used for testing, and was trained as follows:
```
python train.py \
--out_dir="outmini" \
--batch_size=128 \
--max_seq_len=512 \
--gradient_accumulation_steps=1 \
--vocab_source="custom" \
--vocab_size=512 \
--dim=64 \
--n_layers=5 \
--n_heads=8 \
--n_kv_heads=4 \
--multiple_of=4 \
--learning_rate=1e-3 \
--dropout=0.05 \
--weight_decay=0.01 \
--max_iters=100000 \
--beta2=0.99 \
--warmup_iters=1000 \
--eval_interval=2000 \
--eval_iters=100 \
--compile=True
```
You'll notice that `n_kv_heads` is 4 while `n_heads` is 8, so two heads at a time share their key,value projections, i.e. this model is 2X multiquery. You'll also notice that we're using a custom tokenizer with 512 tokens. The model trained for ~10 minutes (?) on my A100 and achieves validation loss of 1.2968.
Sampling this model at temperature 0.0 (i.e. deterministic greedy argmax sampling) gives:
```
$ ./run stories260K/stories260K.bin -z stories260K/tok512.bin -t 0.0
Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she saw a big, red ball. She wanted to play with it, but it was too high.
Lily's mom said, "Lily, let's go to the park." Lily was sad and didn't know what to do. She said, "I want to play with your ball, but I can't find it."
Lily was sad and didn't know what to do. She said, "I'm sorry, Lily. I didn't know what to do."
Lily didn't want to help her mom, so she said, "I'm sorry, mom. I didn't know what to do." Her mom said, "Don't worry, Lily. We can help you.
```
You can reproduce the same in Python by running `sample.py`:
```
$ python sample.py --checkpoint=stories260K/stories260K.pt --tokenizer=stories260K/tok512.model --temperature=0.0 --max_new_tokens=257
```
I hardcoded max tokens to be 257 manually because the `sample.py` script doesn't currently terminate on the special BOS token like the run.c script does. Sampling at 1.0 with topp of 0.9 gives a bit more reasonable samples:
```
$ ./run stories260K/stories260K.bin -z stories260K/tok512.bin -t 1.0 -p 0.9 -s 133742
Once upon a time, there was a little boy named Timmy. Timmy loved to play with his toys and eat sandwiches. One day, Timmy's mom told him it was time to rest for a while. Timmy's friend Billy came over and took him a down.
Timmy's mom saw that Timmy was sad, but Timmy said, "I didn't understand what is it! We need to find some leafs." Timmy thought about it and took a deep breath on a spoon. He hoped it was important to be kind and continued to find its image next time.
After they finished getting, Timmy's dad came up to his house and promised to help Timmy.
```
Hey you can't expect too much from a 260K parameter model. I'm even mildly shocked we get this far :D
+99
View File
@@ -0,0 +1,99 @@
# training llama tokenizer
How does Meta train their sentencepiece tokenizer? You can print the config as follows:
```python
import sentencepiece.sentencepiece_model_pb2
mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
mp.ParseFromString(open("tokenizer.model", "rb").read())
print(mp.trainer_spec)
print(mp.normalizer_spec)
```
this gives:
```
trainer_spec {
input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
model_type: BPE
vocab_size: 32000
self_test_sample_size: 0
input_format: "text"
character_coverage: 0.9999499917030334
input_sentence_size: 200000000
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
num_threads: 80
num_sub_iterations: 2
max_sentence_length: 4192
shuffle_input_sentence: true
max_sentencepiece_length: 16
split_by_unicode_script: true
split_by_whitespace: true
split_by_number: true
treat_whitespace_as_suffix: false
split_digits: true
allow_whitespace_only_pieces: true
vocabulary_output_piece_score: true
hard_vocab_limit: true
use_all_vocab: false
byte_fallback: true
required_chars: ""
unk_id: 0
bos_id: 1
eos_id: 2
pad_id: -1
unk_surface: " \342\201\207 "
unk_piece: "<unk>"
bos_piece: "<s>"
eos_piece: "</s>"
pad_piece: "<pad>"
train_extremely_large_corpus: false
enable_differential_privacy: false
differential_privacy_noise_level: 0.0
differential_privacy_clipping_threshold: 0
}
normalizer_spec {
name: "identity"
precompiled_charsmap: ""
add_dummy_prefix: true
remove_extra_whitespaces: false
normalization_rule_tsv: ""
}
```
We can use the sentencepiece spm_train to train the same models, but optionally smaller. Here are their [options docs](https://github.com/google/sentencepiece/blob/master/doc/options.md) we can refer to. It's not much but it helps.
We'll depart on one setting, I recommend changing `character_coverage` -> 1.0. We also want to make sure to note the following important settings that come up in the paper and are not necessarily the default sentencepiece settings:
```
--split-digits = true
--allow_whitespace_only_pieces = true
--byte_fallback = true
--normalization_rule_name = identity
```
With this in mind we can train a sentencepiece vocab in what I believe is probably the same to how Meta trained theirs as:
```
spm_train --input="$input" \
--model_prefix="$model_prefix" \
--model_type=bpe \
--vocab_size="$vocab_size" \
--self_test_sample_size=0 \
--input_format="text" \
--character_coverage=1.0 \
--num_threads="$(nproc)" \
--split_digits=true \
--allow_whitespace_only_pieces=true \
--byte_fallback=true \
--unk_surface=" \342\201\207 " \
--normalization_rule_name=identity \
```
Where $input is the input file, $model_prefix is the output path prefix, vocab_size is the desired vocab, and we're by default taking over the CPU resources of the machine.
Lastly note that sentencepiece is weird and expects "sentences" delimited by newlines as the input. You can't just put in a massive block of text. And they have a hyperparameter that constols the maximum size of a "sentence". Fwiw I really dislike this design choice around a weird concept of a "sentence". It should just be block of text with no assumptions. But here we are.
Look into the file `tinystories.py` where we train the vocab in the same way, but using Python bindings instead.
+471
View File
@@ -0,0 +1,471 @@
"""
This script has functions and utilties for model export.
Basically, we have a bunch of versions of the model, and we
want to export them to .bin files to be read from and inferenced in C.
Among the "input" versions of PyTorch files/models:
- Official Llama 2 weights released by Meta
- Huggingface weights available on the hub
- llama2.c (this repo) trained models
Among the "output" versions of .bin files:
- v0: Legacy files of the original llama2.c repo (will eventually be DEPRECATED)
- v1-vN: Improved .bin files with a proper header, cache alignment, etc.
This script aspires to provide all of these conversions.
"""
import os
import gzip
import shutil
import struct
import argparse
import json
from pathlib import Path
import numpy as np
import torch
from torch import nn
from model import ModelArgs, Transformer
# -----------------------------------------------------------------------------
# common utilities
def serialize_fp32(file, tensor):
""" writes one fp32 tensor to file that is open in wb mode """
d = tensor.detach().cpu().view(-1).to(torch.float32).numpy()
b = struct.pack(f'{len(d)}f', *d)
file.write(b)
def serialize_int8(file, tensor):
""" writes one int8 tensor to file that is open in wb mode """
d = tensor.detach().cpu().view(-1).numpy().astype(np.int8)
b = struct.pack(f'{len(d)}b', *d)
file.write(b)
def quantize_q80(w, group_size):
"""
takes a tensor and returns the Q8_0 quantized version
i.e. symmetric quantization into int8, range [-127,127]
"""
assert w.numel() % group_size == 0
ori_shape = w.shape
w = w.float() # convert to float32
w = w.reshape(-1, group_size)
# find the max in each group
wmax = torch.abs(w).max(dim=1).values
# calculate the scaling factor such that float = quant * scale
scale = wmax / 127.0
# scale into range [-127, 127]
quant = w / scale[:,None]
# round to nearest integer
int8val = torch.round(quant).to(torch.int8)
# dequantize by rescaling
fp32val = (int8val.float() * scale[:,None]).view(-1)
fp32valr = fp32val.reshape(-1, group_size)
# calculate the max error in each group
err = torch.abs(fp32valr - w).max(dim=1).values
# find the max error across all groups
maxerr = err.max().item()
return int8val, scale, maxerr
# -----------------------------------------------------------------------------
# legacy
def legacy_export(model, filepath):
""" Original export of llama2.c bin files, i.e. version v0 """
out_file = open(filepath, 'wb')
# first write out the header
hidden_dim = model.layers[0].feed_forward.w1.weight.shape[0]
p = model.params
shared_classifier = torch.equal(model.tok_embeddings.weight, model.output.weight)
# legacy format uses negative/positive vocab size as a shared classifier flag
if not shared_classifier:
p.vocab_size = -p.vocab_size
n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
n_kv_heads, p.vocab_size, p.max_seq_len)
out_file.write(header)
# next write out the embedding weights
serialize_fp32(out_file, model.tok_embeddings.weight)
# now all the layers
# attention weights
for layer in model.layers:
serialize_fp32(out_file, layer.attention_norm.weight)
for layer in model.layers:
serialize_fp32(out_file, layer.attention.wq.weight)
for layer in model.layers:
serialize_fp32(out_file, layer.attention.wk.weight)
for layer in model.layers:
serialize_fp32(out_file, layer.attention.wv.weight)
for layer in model.layers:
serialize_fp32(out_file, layer.attention.wo.weight)
# ffn weights
for layer in model.layers:
serialize_fp32(out_file, layer.ffn_norm.weight)
for layer in model.layers:
serialize_fp32(out_file, layer.feed_forward.w1.weight)
for layer in model.layers:
serialize_fp32(out_file, layer.feed_forward.w2.weight)
for layer in model.layers:
serialize_fp32(out_file, layer.feed_forward.w3.weight)
# final rmsnorm
serialize_fp32(out_file, model.norm.weight)
# freqs_cis
serialize_fp32(out_file, model.freqs_cos[:p.max_seq_len])
serialize_fp32(out_file, model.freqs_sin[:p.max_seq_len])
# final classifier weights
if not shared_classifier:
serialize_fp32(out_file, model.output.weight)
# write to binary file
out_file.close()
print(f"wrote {filepath}")
# -----------------------------------------------------------------------------
# new version
def version1_export(model, filepath):
"""
Export the model weights in full float32 .bin file to be read from C.
This is same as legacy_export, but with a proper header.
"""
version = 1
out_file = open(filepath, 'wb')
# first write out the header. the header will be 256 bytes
# 1) write magic, which will be uint32 of "ak42" in ASCII
out_file.write(struct.pack('I', 0x616b3432))
# 2) write version, which will be int
out_file.write(struct.pack('i', version))
# 3) write the params, which will be 7 ints
p = model.params
hidden_dim = model.layers[0].feed_forward.w1.weight.shape[0]
n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
n_kv_heads, p.vocab_size, p.max_seq_len)
out_file.write(header)
# 4) write some other flags
shared_classifier = torch.equal(model.tok_embeddings.weight, model.output.weight)
out_file.write(struct.pack('B', int(shared_classifier)))
pad = 256 - out_file.tell() # pad rest with zeros; tell returns current pos
assert pad >= 0
out_file.write(b'\0' * pad)
# now let's write out all the params
weights = [
*[layer.attention_norm.weight for layer in model.layers],
*[layer.ffn_norm.weight for layer in model.layers],
model.norm.weight,
model.tok_embeddings.weight,
*[layer.attention.wq.weight for layer in model.layers],
*[layer.attention.wk.weight for layer in model.layers],
*[layer.attention.wv.weight for layer in model.layers],
*[layer.attention.wo.weight for layer in model.layers],
*[layer.feed_forward.w1.weight for layer in model.layers],
*[layer.feed_forward.w2.weight for layer in model.layers],
*[layer.feed_forward.w3.weight for layer in model.layers],
]
if not shared_classifier:
weights.append(model.output.weight)
for w in weights:
serialize_fp32(out_file, w)
# write to binary file
out_file.close()
print(f"wrote {filepath}")
def version2_export(model, filepath, group_size=64):
"""
Export the model weights in Q8_0 into .bin file to be read from C.
That is:
- quantize all weights to symmetric int8, in range [-127, 127]
- all other tensors (the rmsnorm params) are kept and exported in fp32
- quantization is done in groups of group_size to reduce the effects of any outliers
"""
version = 2
# let's first do some validation for this export type
while model.params.dim % group_size != 0:
group_size //= 2
print(f"BACKOFF: reducing group size to {group_size} to fit hidden_dim")
weights = [
model.tok_embeddings.weight,
*[layer.attention.wq.weight for layer in model.layers],
*[layer.attention.wk.weight for layer in model.layers],
*[layer.attention.wv.weight for layer in model.layers],
*[layer.attention.wo.weight for layer in model.layers],
*[layer.feed_forward.w1.weight for layer in model.layers],
*[layer.feed_forward.w2.weight for layer in model.layers],
*[layer.feed_forward.w3.weight for layer in model.layers],
]
shared_classifier = torch.equal(model.tok_embeddings.weight, model.output.weight)
if not shared_classifier:
weights.append(model.output.weight)
for w in weights:
assert w.numel() % group_size == 0, f"weight {i} has numel {w.numel()}, not a multiple of group_size {group_size}"
# write
out_file = open(filepath, 'wb')
# first write out the header. the header will be 256 bytes
# 1) write magic, which will be uint32 of "ak42" in ASCII
out_file.write(struct.pack('I', 0x616b3432))
# 2) write version, which will be int
out_file.write(struct.pack('i', version))
# 3) write the params, which will be 7 ints
p = model.params
hidden_dim = model.layers[0].feed_forward.w1.weight.shape[0]
n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
n_kv_heads, p.vocab_size, p.max_seq_len)
out_file.write(header)
# 4) write some other flags
out_file.write(struct.pack('B', int(shared_classifier)))
out_file.write(struct.pack('i', group_size)) # group size used for quantization
pad = 256 - out_file.tell() # pad rest with zeros; tell returns current pos
assert pad >= 0
out_file.write(b'\0' * pad)
# now that the header is done, let's write out the model
# first let's write out all the params that we are keeping in fp32: the norms
for layer in model.layers: # attention norms
serialize_fp32(out_file, layer.attention_norm.weight)
for layer in model.layers: # MLP norms
serialize_fp32(out_file, layer.ffn_norm.weight)
serialize_fp32(out_file, model.norm.weight) # final pre-classifier norm
# now let's write out all the params that we are quantizing to Q8_0
# note we skip classifier weights, which are shared with the embedding
ew = []
scales = []
for i, w in enumerate(weights):
# quantize this weight
q, s, err = quantize_q80(w, group_size)
# save the int8 weights to file
serialize_int8(out_file, q) # save the tensor in int8
scales.append(s) # we'll do all the scales after all the qs
# logging
ew.append((err, w.shape))
print(f"{i+1}/{len(weights)} quantized {tuple(w.shape)} to Q8_0 with max error {err}")
# save the scaling factors in fp32 here
# this is done to keep all the weights contiquous, making pointer arithmetic easier in C
for s in scales:
serialize_fp32(out_file, s)
# print the highest error across all weights, should be very small, e.g. O(~0.001)
ew.sort(reverse=True)
print(f"max quantization group error across all weights: {ew[0][0]}")
# write to binary file
out_file.close()
print(f"wrote {filepath}")
# -----------------------------------------------------------------------------
# Load / import functions
def load_checkpoint(checkpoint):
# load the provided model checkpoint
checkpoint_dict = torch.load(checkpoint, map_location='cpu')
gptconf = ModelArgs(**checkpoint_dict['model_args'])
model = Transformer(gptconf)
state_dict = checkpoint_dict['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
if k.startswith(unwanted_prefix):
state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
model.load_state_dict(state_dict, strict=False)
model.eval()
return model
def load_meta_model(model_path):
params_path = os.path.join(model_path, 'params.json')
with open(params_path) as f:
params = json.load(f)
print(params)
model_paths = sorted(list(Path(model_path).glob('consolidated.*.pth')))
models = [torch.load(p, map_location='cpu') for p in model_paths]
def concat_weights(models):
state_dict = {}
for name in list(models[0]):
tensors = [model[name] for model in models]
if len(tensors) == 1 or len(tensors[0].shape) == 1:
state_dict[name] = tensors[0]
continue
is_axis_1 = (
name.startswith('tok_embeddings.')
or name.endswith('.attention.wo.weight')
or name.endswith('.feed_forward.w2.weight')
)
axis = 1 if is_axis_1 else 0
state_dict[name] = torch.cat(tensors, dim=axis)
for model in models:
del model[name]
return state_dict
state_dict = concat_weights(models)
del models
# set ModelArgs
config = ModelArgs()
config.dim = params["dim"]
config.n_layers = params["n_layers"]
config.n_heads = params["n_heads"]
config.n_kv_heads = params.get('n_kv_heads') or params['n_heads']
config.multiple_of = params["multiple_of"]
config.norm_eps = params["norm_eps"]
config.vocab_size = state_dict['tok_embeddings.weight'].shape[0]
config.max_seq_len = 2048
# create a new Transformer object and set weights
model = Transformer(config)
model.tok_embeddings.weight = nn.Parameter(state_dict['tok_embeddings.weight'])
model.norm.weight = nn.Parameter(state_dict['norm.weight'])
for layer in model.layers:
i = layer.layer_id
layer.attention_norm.weight = nn.Parameter(state_dict[f'layers.{i}.attention_norm.weight'])
layer.attention.wq.weight = nn.Parameter(state_dict[f'layers.{i}.attention.wq.weight'])
layer.attention.wk.weight = nn.Parameter(state_dict[f'layers.{i}.attention.wk.weight'])
layer.attention.wv.weight = nn.Parameter(state_dict[f'layers.{i}.attention.wv.weight'])
layer.attention.wo.weight = nn.Parameter(state_dict[f'layers.{i}.attention.wo.weight'])
layer.ffn_norm.weight = nn.Parameter(state_dict[f'layers.{i}.ffn_norm.weight'])
layer.feed_forward.w1.weight = nn.Parameter(state_dict[f'layers.{i}.feed_forward.w1.weight'])
layer.feed_forward.w2.weight = nn.Parameter(state_dict[f'layers.{i}.feed_forward.w2.weight'])
layer.feed_forward.w3.weight = nn.Parameter(state_dict[f'layers.{i}.feed_forward.w3.weight'])
# final classifier
model.output.weight = nn.Parameter(state_dict['output.weight'])
model.eval()
return model
def load_hf_model(model_path):
try:
from transformers import AutoModelForCausalLM
except ImportError:
print("Error: transformers package is required to load huggingface models")
print("Please run `pip install transformers` to install it")
return None
# load HF model
hf_model = AutoModelForCausalLM.from_pretrained(model_path)
hf_dict = hf_model.state_dict()
# convert LlamaConfig to ModelArgs
config = ModelArgs()
config.dim = hf_model.config.hidden_size
config.n_layers = hf_model.config.num_hidden_layers
config.n_heads = hf_model.config.num_attention_heads
config.n_kv_heads = hf_model.config.num_attention_heads
config.vocab_size = hf_model.config.vocab_size
config.hidden_dim = hf_model.config.intermediate_size
config.norm_eps = hf_model.config.rms_norm_eps
config.max_seq_len = hf_model.config.max_position_embeddings
# create a new Transformer object and set weights
model = Transformer(config)
model.tok_embeddings.weight = nn.Parameter(hf_dict['model.embed_tokens.weight'])
model.norm.weight = nn.Parameter(hf_dict['model.norm.weight'])
# huggingface permutes WQ and WK, this function reverses it
def permute_reverse(w, n_heads=config.n_heads, dim1=config.dim, dim2=config.dim):
return w.view(n_heads, 2, dim1 // n_heads // 2, dim2).transpose(1, 2).reshape(dim1, dim2)
for layer in model.layers:
i = layer.layer_id
layer.attention_norm.weight = nn.Parameter(hf_dict[f'model.layers.{i}.input_layernorm.weight'])
layer.attention.wq.weight = nn.Parameter(permute_reverse(hf_dict[f'model.layers.{i}.self_attn.q_proj.weight']))
layer.attention.wk.weight = nn.Parameter(permute_reverse(hf_dict[f'model.layers.{i}.self_attn.k_proj.weight']))
layer.attention.wv.weight = nn.Parameter(hf_dict[f'model.layers.{i}.self_attn.v_proj.weight'])
layer.attention.wo.weight = nn.Parameter(hf_dict[f'model.layers.{i}.self_attn.o_proj.weight'])
layer.ffn_norm.weight = nn.Parameter(hf_dict[f'model.layers.{i}.post_attention_layernorm.weight'])
layer.feed_forward.w1.weight = nn.Parameter(hf_dict[f'model.layers.{i}.mlp.gate_proj.weight'])
layer.feed_forward.w2.weight = nn.Parameter(hf_dict[f'model.layers.{i}.mlp.down_proj.weight'])
layer.feed_forward.w3.weight = nn.Parameter(hf_dict[f'model.layers.{i}.mlp.up_proj.weight'])
# final classifier
model.output.weight = nn.Parameter(hf_dict['lm_head.weight'])
model.eval()
return model
# -----------------------------------------------------------------------------
# API entrypoint
def model_export(model, filepath, version):
if version == 0:
legacy_export(model, filepath)
elif version == 1:
version1_export(model, filepath)
elif version == 2:
version2_export(model, filepath)
else:
raise ValueError(f"unknown version {version}")
def torchscript_export(model, filepath, zero_params=False, gzip_output=False):
"""
(This was submitted via a PR earlier. Leaving it here, but "orphaned" for now)
Saves the model as a TorchScript.
The resulting file can be loaded in C++ code and then used for training or
inference with:
#include <torch/script.h>
torch::jit::Module module = torch::jit::load("model.pt")
Note that the serialized model includes the initial parameters and with the default
ModelArgs the file is 59M and gzips down to 55M. If you want to serialize/distribute
the model parameters separately you can zero out the parameters before saving it and
it will gzip down to 780K.
"""
# If requested zero params before saving the model. This is useful in
# conjunction with gzip_output.
if zero_params:
for p in model.parameters():
p.detach().zero_()
torch.jit.save(torch.jit.script(model), filepath)
if gzip_output:
with open(filepath, "rb") as f_in:
with gzip.open(f"{filepath}.gz", "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
os.unlink(filepath)
# -----------------------------------------------------------------------------
# CLI entrypoint
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("filepath", type=str, help="the output filepath")
parser.add_argument("--version", default=0, type=int, help="the version to export with")
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument("--checkpoint", type=str, help="model checkpoint, .pt file")
group.add_argument("--meta-llama", type=str, help="meta llama model path")
group.add_argument("--hf", type=str, help="huggingface model path")
args = parser.parse_args()
if args.checkpoint:
model = load_checkpoint(args.checkpoint)
elif args.meta_llama:
model = load_meta_model(args.meta_llama)
elif args.hf:
model = load_hf_model(args.hf)
if model is None:
parser.error("Can't load input model!")
# export
model_export(model, args.filepath, args.version)
-112
View File
@@ -1,112 +0,0 @@
"""
This script exports the Llama 2 weights in llama2c.bin format.
"""
import os
import sys
import struct
from pathlib import Path
import json
import torch
from model import precompute_freqs_cis
def export(p, state_dict, filepath='model.bin'):
"""export the model weights in fp32 into .bin file to be read from C"""
f = open(filepath, 'wb')
def serialize(key):
print(f"writing {key}...")
t = state_dict[key].contiguous().view(-1).type(torch.float32).numpy()
f.write(memoryview(t))
del state_dict[key]
# first write out the header
hidden_dim = state_dict['layers.0.feed_forward.w1.weight'].shape[0]
p['vocab_size'] = 32000
p['max_seq_len'] = 2048
n_kv_heads = p.get('n_kv_heads') or p['n_heads']
header = struct.pack(
'iiiiiii',
p['dim'], hidden_dim, p['n_layers'], p['n_heads'],
n_kv_heads, -p['vocab_size'], p['max_seq_len']
)
# NOTE ABOVE: -ve vocab_size is indicating that the classifier weights are present
# in the checkpoint and should be loaded.
f.write(header)
# next write out the embedding weights
print("writing tok_embeddings...")
serialize('tok_embeddings.weight')
# now all the layers
# attention weights
for i in range(p['n_layers']): serialize(f'layers.{i}.attention_norm.weight')
for i in range(p['n_layers']): serialize(f'layers.{i}.attention.wq.weight')
for i in range(p['n_layers']): serialize(f'layers.{i}.attention.wk.weight')
for i in range(p['n_layers']): serialize(f'layers.{i}.attention.wv.weight')
for i in range(p['n_layers']): serialize(f'layers.{i}.attention.wo.weight')
# ffn weights
for i in range(p['n_layers']): serialize(f'layers.{i}.ffn_norm.weight')
for i in range(p['n_layers']): serialize(f'layers.{i}.feed_forward.w1.weight')
for i in range(p['n_layers']): serialize(f'layers.{i}.feed_forward.w2.weight')
for i in range(p['n_layers']): serialize(f'layers.{i}.feed_forward.w3.weight')
# final rmsnorm
serialize('norm.weight')
# freqs_cos, freqs_sin
freqs_cos, freqs_sin = precompute_freqs_cis(p['dim'] // p['n_heads'], p['max_seq_len'] * 2)
state_dict['freqs_cos'] = freqs_cos[:p['max_seq_len']]
state_dict['freqs_sin'] = freqs_sin[:p['max_seq_len']]
serialize('freqs_cos')
serialize('freqs_sin')
# finally write the output weights
serialize('output.weight')
f.close()
print(f"wrote {filepath}")
def concat_weights(models):
state_dict = {}
for name in list(models[0]):
tensors = [model[name] for model in models]
if len(tensors) == 1 or len(tensors[0].shape) == 1:
state_dict[name] = tensors[0]
continue
is_axis_1 = (
name.startswith('tok_embeddings.')
or name.endswith('.attention.wo.weight')
or name.endswith('.feed_forward.w2.weight')
)
axis = 1 if is_axis_1 else 0
state_dict[name] = torch.cat(tensors, dim=axis)
for model in models:
del model[name]
return state_dict
def load_and_export(model_path, output_path):
params_path = os.path.join(model_path, 'params.json')
with open(params_path) as f:
params = json.load(f)
print(params)
model_paths = sorted(list(Path(model_path).glob('consolidated.*.pth')))
models = [torch.load(p, map_location='cpu') for p in model_paths]
state_dict = concat_weights(models)
del models
export(params, state_dict, output_path)
if __name__ == '__main__':
if len(sys.argv) == 1:
print('[Llama model folder path] [output path]')
exit()
model_path = sys.argv[1]
output_path = sys.argv[2]
load_and_export(model_path, output_path)
-113
View File
@@ -1,113 +0,0 @@
"""
This script exports the Llama 2 weights in llama2c.bin format.
"""
import os
import sys
import struct
from pathlib import Path
import json
import torch
from model import precompute_freqs_cis
def export(p, state_dict, filepath='model.bin'):
"""export the model weights in fp32 into .bin file to be read from C"""
f = open(filepath, 'wb')
def serialize(key):
print(f"writing {key}...")
t = state_dict[key].contiguous().view(-1).type(torch.float32).numpy()
f.write(memoryview(t))
del state_dict[key]
# first write out the header
hidden_dim = state_dict['model.layers.0.mlp.gate_proj.weight'].shape[0]
p['vocab_size'] = 32000
p['max_seq_len'] = 2048
n_kv_heads = p.get('n_kv_heads') or p['n_heads']
header = struct.pack(
'iiiiiii',
p['dim'], hidden_dim, p['n_layers'], p['n_heads'],
n_kv_heads, -p['vocab_size'], p['max_seq_len']
)
# NOTE ABOVE: -ve vocab_size is indicating that the classifier weights are present
# in the checkpoint and should be loaded.
f.write(header)
# next write out the embedding weights
print("writing tok_embeddings...")
serialize('model.embed_tokens.weight')
# now all the layers
# attention weights
for i in range(p['n_layers']): serialize(f'model.layers.{i}.input_layernorm.weight')
for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.q_proj.weight')
for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.k_proj.weight')
for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.v_proj.weight')
for i in range(p['n_layers']): serialize(f'model.layers.{i}.self_attn.o_proj.weight')
# ffn weights
for i in range(p['n_layers']): serialize(f'model.layers.{i}.post_attention_layernorm.weight')
for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.gate_proj.weight')
for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.down_proj.weight')
for i in range(p['n_layers']): serialize(f'model.layers.{i}.mlp.up_proj.weight')
# final rmsnorm
serialize('model.norm.weight')
# freqs_cos, freqs_sin
freqs_cos, freqs_sin = precompute_freqs_cis(p['dim'] // p['n_heads'], p['max_seq_len'] * 2)
state_dict['freqs_cos'] = freqs_cos[:p['max_seq_len']]
state_dict['freqs_sin'] = freqs_sin[:p['max_seq_len']]
# check if this requires addtional conversion
serialize('freqs_cos')
serialize('freqs_sin')
# finally write the output weights
serialize('lm_head.weight')
f.close()
print(f"wrote {filepath}")
def concat_weights(models):
state_dict = {}
for name in list(models[0]):
tensors = [model[name] for model in models]
if len(tensors) == 1 or len(tensors[0].shape) == 1:
state_dict[name] = tensors[0]
continue
is_axis_1 = (
name.startswith('model.embed_tokens.weight')
or name.endswith('.self_attn.o_proj.weight')
or name.endswith('.mlp.down_proj.weight')
)
axis = 1 if is_axis_1 else 0
state_dict[name] = torch.cat(tensors, dim=axis)
for model in models:
del model[name]
return state_dict
def load_and_export(model_path, output_path):
params_path = os.path.join(model_path, 'params.json')
with open(params_path) as f:
params = json.load(f)
print(params)
model_paths = sorted(list(Path(model_path).glob('consolidated.*.pth')))
models = [torch.load(p, map_location='cpu') for p in model_paths]
state_dict = concat_weights(models)
del models
export(params, state_dict, output_path)
if __name__ == '__main__':
if len(sys.argv) == 1:
print('[Llama model folder path] [output path]')
exit()
model_path = sys.argv[1]
output_path = sys.argv[2]
load_and_export(model_path, output_path)
+6 -55
View File
@@ -17,6 +17,7 @@ class ModelArgs:
n_heads: int = 32
n_kv_heads: Optional[int] = None
vocab_size: int = 32000
hidden_dim: Optional[int] = None
multiple_of: int = 256 # MLP hidden layer size will be multiple of
norm_eps: float = 1e-5
max_seq_len: int = 2048
@@ -166,8 +167,10 @@ class Attention(nn.Module):
class FeedForward(nn.Module):
def __init__(self, dim: int, hidden_dim: int, multiple_of: int, dropout: float):
super().__init__()
hidden_dim = int(2 * hidden_dim / 3)
hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
if hidden_dim is None:
hidden_dim = 4 * dim
hidden_dim = int(2 * hidden_dim / 3)
hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
self.w1 = nn.Linear(dim, hidden_dim, bias=False)
self.w2 = nn.Linear(hidden_dim, dim, bias=False)
self.w3 = nn.Linear(dim, hidden_dim, bias=False)
@@ -186,7 +189,7 @@ class TransformerBlock(nn.Module):
self.attention = Attention(args)
self.feed_forward = FeedForward(
dim=args.dim,
hidden_dim=4 * args.dim,
hidden_dim=args.hidden_dim,
multiple_of=args.multiple_of,
dropout=args.dropout,
)
@@ -338,55 +341,3 @@ class Transformer(nn.Module):
idx = torch.cat((idx, idx_next), dim=1)
return idx
def export(self, filepath='model.bin'):
"""export the model weights in fp32 into .bin file to be read from C"""
f = open(filepath, 'wb')
def serialize(t):
d = t.detach().cpu().view(-1).numpy().astype(np.float32)
b = struct.pack(f'{len(d)}f', *d)
f.write(b)
# first write out the header
hidden_dim = self.layers[0].feed_forward.w1.weight.shape[0]
p = self.params
n_kv_heads = p.n_heads if p.n_kv_heads is None else p.n_kv_heads
header = struct.pack('iiiiiii', p.dim, hidden_dim, p.n_layers, p.n_heads,
n_kv_heads, p.vocab_size, p.max_seq_len)
f.write(header)
# next write out the embedding weights
serialize(self.tok_embeddings.weight)
# now all the layers
# attention weights
for layer in self.layers:
serialize(layer.attention_norm.weight)
for layer in self.layers:
serialize(layer.attention.wq.weight)
for layer in self.layers:
serialize(layer.attention.wk.weight)
for layer in self.layers:
serialize(layer.attention.wv.weight)
for layer in self.layers:
serialize(layer.attention.wo.weight)
# ffn weights
for layer in self.layers:
serialize(layer.ffn_norm.weight)
for layer in self.layers:
serialize(layer.feed_forward.w1.weight)
for layer in self.layers:
serialize(layer.feed_forward.w2.weight)
for layer in self.layers:
serialize(layer.feed_forward.w3.weight)
# final rmsnorm
serialize(self.norm.weight)
# note: no need to write final classifier weights due to weight sharing
# freqs_cis
serialize(self.freqs_cos[:p.max_seq_len])
serialize(self.freqs_sin[:p.max_seq_len])
# write to binary file
f.close()
print(f"wrote {filepath}")
+455 -217
View File
@@ -14,7 +14,7 @@
#include <sys/mman.h>
#endif
// ----------------------------------------------------------------------------
// Transformer and RunState structs, and related memory management
// Transformer model
typedef struct {
int dim; // transformer dimension
@@ -43,18 +43,10 @@ typedef struct {
float* w3; // (layer, hidden_dim, dim)
// final rmsnorm
float* rms_final_weight; // (dim,)
// freq_cis for RoPE relatively positional embeddings (not used anymore)
float* freq_cis_real; // (seq_len, head_size/2)
float* freq_cis_imag; // (seq_len, head_size/2)
// (optional) classifier weights for the logits, on the last layer
float* wcls;
} TransformerWeights;
typedef struct {
float prob;
int index;
} ProbIndex; // struct used when sorting probabilities during top-p sampling
typedef struct {
// current wave of activations
float *x; // activation at current time stamp (dim,)
@@ -67,12 +59,21 @@ typedef struct {
float *v; // value (dim,)
float *att; // buffer for scores/attention values (n_heads, seq_len)
float *logits; // output logits
ProbIndex *probindex; // buffer used in top-p sampling
// kv cache
float* key_cache; // (layer, seq_len, dim)
float* value_cache; // (layer, seq_len, dim)
} RunState;
typedef struct {
Config config; // the hyperparameters of the architecture (the blueprint)
TransformerWeights weights; // the weights of the model
RunState state; // buffers for the "wave" of activations in the forward pass
// some more state needed to properly clean up the memory mapping (sigh)
int fd; // file descriptor for memory mapping
float* data; // memory mapped data pointer
ssize_t file_size; // size of the checkpoint file in bytes
} Transformer;
void malloc_run_state(RunState* s, Config* p) {
// we calloc instead of malloc to keep valgrind happy
int kv_dim = (p->dim * p->n_kv_heads) / p->n_heads;
@@ -86,13 +87,12 @@ void malloc_run_state(RunState* s, Config* p) {
s->v = calloc(kv_dim, sizeof(float));
s->att = calloc(p->n_heads * p->seq_len, sizeof(float));
s->logits = calloc(p->vocab_size, sizeof(float));
s->probindex = calloc(p->vocab_size, sizeof(ProbIndex));
s->key_cache = calloc(p->n_layers * p->seq_len * kv_dim, sizeof(float));
s->value_cache = calloc(p->n_layers * p->seq_len * kv_dim, sizeof(float));
// ensure all mallocs went fine
if (!s->x || !s->xb || !s->xb2 || !s->hb || !s->hb2 || !s->q
|| !s->k || !s->v || !s->att || !s->logits || !s->key_cache
|| !s->value_cache || !s->probindex) {
|| !s->value_cache) {
fprintf(stderr, "malloc failed!\n");
exit(EXIT_FAILURE);
}
@@ -109,15 +109,11 @@ void free_run_state(RunState* s) {
free(s->v);
free(s->att);
free(s->logits);
free(s->probindex);
free(s->key_cache);
free(s->value_cache);
}
// ----------------------------------------------------------------------------
// initialization: read from checkpoint
void checkpoint_init_weights(TransformerWeights *w, Config* p, float* ptr, int shared_weights) {
void memory_map_weights(TransformerWeights *w, Config* p, float* ptr, int shared_weights) {
int head_size = p->dim / p->n_heads;
w->token_embedding_table = ptr;
ptr += p->vocab_size * p->dim;
@@ -141,15 +137,50 @@ void checkpoint_init_weights(TransformerWeights *w, Config* p, float* ptr, int s
ptr += p->n_layers * p->dim * p->hidden_dim;
w->rms_final_weight = ptr;
ptr += p->dim;
w->freq_cis_real = ptr;
ptr += p->seq_len * head_size / 2;
w->freq_cis_imag = ptr;
ptr += p->seq_len * head_size / 2;
ptr += p->seq_len * head_size / 2; // skip what used to be freq_cis_real (for RoPE)
ptr += p->seq_len * head_size / 2; // skip what used to be freq_cis_imag (for RoPE)
w->wcls = shared_weights ? w->token_embedding_table : ptr;
}
void read_checkpoint(char* checkpoint, Config* config, TransformerWeights* weights,
int* fd, float** data, ssize_t* file_size) {
FILE *file = fopen(checkpoint, "rb");
if (!file) { fprintf(stderr, "Couldn't open file %s\n", checkpoint); exit(EXIT_FAILURE); }
// read in the config header
if (fread(config, sizeof(Config), 1, file) != 1) { exit(EXIT_FAILURE); }
// negative vocab size is hacky way of signaling unshared weights. bit yikes.
int shared_weights = config->vocab_size > 0 ? 1 : 0;
config->vocab_size = abs(config->vocab_size);
// figure out the file size
fseek(file, 0, SEEK_END); // move file pointer to end of file
*file_size = ftell(file); // get the file size, in bytes
fclose(file);
// memory map the Transformer weights into the data pointer
*fd = open(checkpoint, O_RDONLY); // open in read only mode
if (*fd == -1) { fprintf(stderr, "open failed!\n"); exit(EXIT_FAILURE); }
*data = mmap(NULL, *file_size, PROT_READ, MAP_PRIVATE, *fd, 0);
if (*data == MAP_FAILED) { fprintf(stderr, "mmap failed!\n"); exit(EXIT_FAILURE); }
float* weights_ptr = *data + sizeof(Config)/sizeof(float);
memory_map_weights(weights, config, weights_ptr, shared_weights);
}
void build_transformer(Transformer *t, char* checkpoint_path) {
// read in the Config and the Weights from the checkpoint
read_checkpoint(checkpoint_path, &t->config, &t->weights, &t->fd, &t->data, &t->file_size);
// allocate the RunState buffers
malloc_run_state(&t->state, &t->config);
}
void free_transformer(Transformer* t) {
// close the memory mapping
if (t->data != MAP_FAILED) { munmap(t->data, t->file_size); }
if (t->fd != -1) { close(t->fd); }
// free the RunState buffers
free_run_state(&t->state);
}
// ----------------------------------------------------------------------------
// neural net blocks
// neural net blocks; the dynamics of the Transformer
void rmsnorm(float* o, float* x, float* weight, int size) {
// calculate sum of squares
@@ -200,9 +231,12 @@ void matmul(float* xout, float* x, float* w, int n, int d) {
}
}
void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights* w) {
float* forward(Transformer* transformer, int token, int pos) {
// a few convenience variables
Config* p = &transformer->config;
TransformerWeights* w = &transformer->weights;
RunState* s = &transformer->state;
float *x = s->x;
int dim = p->dim;
int kv_dim = (p->dim * p->n_kv_heads) / p->n_heads;
@@ -211,7 +245,7 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
int head_size = dim / p->n_heads;
// copy the token embedding into x
float* content_row = &(w->token_embedding_table[token * dim]);
float* content_row = w->token_embedding_table + token * dim;
memcpy(x, content_row, dim*sizeof(*x));
// forward all the layers
@@ -305,14 +339,14 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
matmul(s->hb, s->xb, w->w1 + l*dim*hidden_dim, dim, hidden_dim);
matmul(s->hb2, s->xb, w->w3 + l*dim*hidden_dim, dim, hidden_dim);
// F.silu; silu(x)=x*σ(x),where σ(x) is the logistic sigmoid
// SwiGLU non-linearity
for (int i = 0; i < hidden_dim; i++) {
s->hb[i] = s->hb[i] * (1.0f / (1.0f + expf(-s->hb[i])));
}
// elementwise multiply with w3(x)
for (int i = 0; i < hidden_dim; i++) {
s->hb[i] = s->hb[i] * s->hb2[i];
float val = s->hb[i];
// silu(x)=x*σ(x), where σ(x) is the logistic sigmoid
val *= (1.0f / (1.0f + expf(-val)));
// elementwise multiply with w3(x)
val *= s->hb2[i];
s->hb[i] = val;
}
// final matmul to get the output of the ffn
@@ -329,20 +363,90 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
// classifier into logits
matmul(s->logits, x, w->wcls, p->dim, p->vocab_size);
return s->logits;
}
// ----------------------------------------------------------------------------
// byte pair encoding (BPE) tokenizer, encodes strings into tokens so we can prompt
// The Byte Pair Encoding (BPE) Tokenizer that translates strings <-> tokens
typedef struct {
char *str;
int id;
} TokenIndex;
typedef struct {
char** vocab;
float* vocab_scores;
TokenIndex *sorted_vocab;
int vocab_size;
unsigned int max_token_length;
unsigned char byte_pieces[512]; // stores all single-byte strings
} Tokenizer;
int compare_tokens(const void *a, const void *b) {
return strcmp(((TokenIndex*)a)->str, ((TokenIndex*)b)->str);
}
void build_tokenizer(Tokenizer* t, char* tokenizer_path, int vocab_size) {
// i should have written the vocab_size into the tokenizer file... sigh
t->vocab_size = vocab_size;
// malloc space to hold the scores and the strings
t->vocab = (char**)malloc(vocab_size * sizeof(char*));
t->vocab_scores = (float*)malloc(vocab_size * sizeof(float));
t->sorted_vocab = NULL; // initialized lazily
for (int i = 0; i < 256; i++) {
t->byte_pieces[i * 2] = (unsigned char)i;
t->byte_pieces[i * 2 + 1] = '\0';
}
// read in the file
FILE *file = fopen(tokenizer_path, "rb");
if (!file) { fprintf(stderr, "couldn't load %s\n", tokenizer_path); exit(EXIT_FAILURE); }
if (fread(&t->max_token_length, sizeof(int), 1, file) != 1) { fprintf(stderr, "failed read\n"); exit(EXIT_FAILURE); }
int len;
for (int i = 0; i < vocab_size; i++) {
if (fread(t->vocab_scores + i, sizeof(float), 1, file) != 1) { fprintf(stderr, "failed read\n"); exit(EXIT_FAILURE);}
if (fread(&len, sizeof(int), 1, file) != 1) { fprintf(stderr, "failed read\n"); exit(EXIT_FAILURE); }
t->vocab[i] = (char *)malloc(len + 1);
if (fread(t->vocab[i], len, 1, file) != 1) { fprintf(stderr, "failed read\n"); exit(EXIT_FAILURE); }
t->vocab[i][len] = '\0'; // add the string terminating token
}
fclose(file);
}
void free_tokenizer(Tokenizer* t) {
for (int i = 0; i < t->vocab_size; i++) { free(t->vocab[i]); }
free(t->vocab);
free(t->vocab_scores);
free(t->sorted_vocab);
}
char* decode(Tokenizer* t, int prev_token, int token) {
char *piece = t->vocab[token];
// following BOS (1) token, sentencepiece decoder strips any leading whitespace (see PR #89)
if (prev_token == 1 && piece[0] == ' ') { piece++; }
// careful, some tokens designate raw bytes, and look like e.g. '<0x01>'
// parse this and convert and return the actual byte
unsigned char byte_val;
if (sscanf(piece, "<0x%02hhX>", &byte_val) == 1) {
piece = (char*)t->byte_pieces + byte_val * 2;
}
return piece;
}
void safe_printf(char *piece) {
// piece might be a raw byte token, and we only want to print printable chars or whitespace
// because some of the other bytes can be various control codes, backspace, etc.
if (piece == NULL) { return; }
if (piece[0] == '\0') { return; }
if (piece[1] == '\0') {
unsigned char byte_val = piece[0];
if (!(isprint(byte_val) || isspace(byte_val))) {
return; // bad byte, don't print it
}
}
printf("%s", piece);
}
int str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) {
// efficiently find the perfect match for str in vocab, return its index or -1 if not found
TokenIndex tok = { .str = str }; // acts as the key to search for
@@ -350,23 +454,40 @@ int str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) {
return res != NULL ? res->id : -1;
}
void bpe_encode(char *text, char **vocab, float *vocab_scores, int vocab_size, unsigned int max_token_length, int *tokens, int *n_tokens) {
void encode(Tokenizer* t, char *text, int8_t bos, int8_t eos, int *tokens, int *n_tokens) {
// encode the string text (input) into an upper-bound preallocated tokens[] array
// bos != 0 means prepend the BOS token (=1), eos != 0 means append the EOS token (=2)
if (text == NULL) { fprintf(stderr, "cannot encode NULL text\n"); exit(EXIT_FAILURE); }
// sort vocabulary
TokenIndex *sorted_vocab = malloc(vocab_size * sizeof(TokenIndex));
for (int i = 0; i < vocab_size; i++) {
sorted_vocab[i].str = vocab[i];
sorted_vocab[i].id = i;
if (t->sorted_vocab == NULL) {
// lazily malloc and sort the vocabulary
t->sorted_vocab = malloc(t->vocab_size * sizeof(TokenIndex));
for (int i = 0; i < t->vocab_size; i++) {
t->sorted_vocab[i].str = t->vocab[i];
t->sorted_vocab[i].id = i;
}
qsort(t->sorted_vocab, t->vocab_size, sizeof(TokenIndex), compare_tokens);
}
qsort(sorted_vocab, vocab_size, sizeof(TokenIndex), compare_tokens);
// create a temporary buffer that will store merge candidates of always two consecutive tokens
char* str_buffer = malloc((max_token_length*2 +1 +2) * sizeof(char)); // *2 for concat, +1 for null terminator +2 for UTF8 (in case max_token_lenght is 1)
// *2 for concat, +1 for null terminator +2 for UTF8 (in case max_token_length is 1)
char* str_buffer = malloc((t->max_token_length*2 +1 +2) * sizeof(char));
size_t str_len = 0;
// start at 0 tokens
*n_tokens = 0;
// add optional BOS (=1) token, if desired
if (bos) tokens[(*n_tokens)++] = 1;
// add_dummy_prefix is true by default
tokens[0] = str_lookup(" ", sorted_vocab, vocab_size);
*n_tokens = 1; // the number of tokens
// so prepend a dummy prefix token to the input string, but only if text != ""
// TODO: pretty sure this isn't correct in the general case but I don't have the
// energy to read more of the sentencepiece code to figure out what it's doing
if (text[0] != '\0') {
int dummy_prefix = str_lookup(" ", t->sorted_vocab, t->vocab_size);
tokens[(*n_tokens)++] = dummy_prefix;
}
// Okay UTF-8 time. This will get messy. Here is the reference from Wikipedia:
// Code point ↔ UTF-8 conversion
@@ -401,7 +522,7 @@ void bpe_encode(char *text, char **vocab, float *vocab_scores, int vocab_size, u
}
// ok c+1 is not a continuation byte, so we've read in a full codepoint
int id = str_lookup(str_buffer, sorted_vocab, vocab_size);
int id = str_lookup(str_buffer, t->sorted_vocab, t->vocab_size);
if (id != -1) {
// we found this codepoint in vocab, add it as a token
@@ -425,11 +546,11 @@ void bpe_encode(char *text, char **vocab, float *vocab_scores, int vocab_size, u
for (int i=0; i < (*n_tokens-1); i++) {
// check if we can merge the pair (tokens[i], tokens[i+1])
sprintf(str_buffer, "%s%s", vocab[tokens[i]], vocab[tokens[i+1]]);
int id = str_lookup(str_buffer, sorted_vocab, vocab_size);
if (id != -1 && vocab_scores[id] > best_score) {
sprintf(str_buffer, "%s%s", t->vocab[tokens[i]], t->vocab[tokens[i+1]]);
int id = str_lookup(str_buffer, t->sorted_vocab, t->vocab_size);
if (id != -1 && t->vocab_scores[id] > best_score) {
// this merge pair exists in vocab! record its score and position
best_score = vocab_scores[id];
best_score = t->vocab_scores[id];
best_id = id;
best_idx = i;
}
@@ -448,36 +569,30 @@ void bpe_encode(char *text, char **vocab, float *vocab_scores, int vocab_size, u
(*n_tokens)--; // token length decreased
}
// add optional EOS (=2) token, if desired
if (eos) tokens[(*n_tokens)++] = 2;
free(str_buffer);
free(sorted_vocab);
}
// ----------------------------------------------------------------------------
// utilities: time / rng
long time_in_ms() {
// return time in milliseconds, for benchmarking the model speed
struct timespec time;
clock_gettime(CLOCK_REALTIME, &time);
return time.tv_sec * 1000 + time.tv_nsec / 1000000;
}
unsigned long long rng_seed;
unsigned int random_u32() {
// xorshift rng: https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
rng_seed ^= rng_seed >> 12;
rng_seed ^= rng_seed << 25;
rng_seed ^= rng_seed >> 27;
return (rng_seed * 0x2545F4914F6CDD1Dull) >> 32;
}
float random_f32() { // random float32 in [0,1)
return (random_u32() >> 8) / 16777216.0f;
}
// ----------------------------------------------------------------------------
// The Sampler, which takes logits and returns a sampled token
// sampling can be done in a few ways: greedy argmax, sampling, top-p sampling
int argmax(float* probabilities, int n) {
typedef struct {
float prob;
int index;
} ProbIndex; // struct used when sorting probabilities during top-p sampling
typedef struct {
int vocab_size;
ProbIndex* probindex; // buffer used in top-p sampling
float temperature;
float topp;
unsigned long long rng_state;
} Sampler;
int sample_argmax(float* probabilities, int n) {
// return the index that has the highest probability
int max_i = 0;
float max_p = probabilities[0];
@@ -490,13 +605,13 @@ int argmax(float* probabilities, int n) {
return max_i;
}
int sample(float* probabilities, int n) {
int sample_mult(float* probabilities, int n, float coin) {
// sample index from probabilities (they must sum to 1!)
float r = random_f32();
// coin is a random number in [0, 1), usually from random_f32()
float cdf = 0.0f;
for (int i = 0; i < n; i++) {
cdf += probabilities[i];
if (r < cdf) {
if (coin < cdf) {
return i;
}
}
@@ -511,10 +626,11 @@ int compare(const void* a, const void* b) {
return 0;
}
int sample_topp(float* probabilities, int n, float topp, ProbIndex* probindex) {
int sample_topp(float* probabilities, int n, float topp, ProbIndex* probindex, float coin) {
// top-p sampling (or "nucleus sampling") samples from the smallest set of
// tokens that exceed probability topp. This way we never sample tokens that
// have very low probabilities and are less likely to go "off the rails".
// coin is a random number in [0, 1), usually from random_f32()
int n0 = 0;
// quicksort indices in descending order of probabilities
@@ -542,7 +658,7 @@ int sample_topp(float* probabilities, int n, float topp, ProbIndex* probindex) {
}
// sample from the truncated list
float r = random_f32() * cumulative_prob;
float r = coin * cumulative_prob;
float cdf = 0.0f;
for (int i = 0; i <= last_idx; i++) {
cdf += probindex[i].prob;
@@ -553,167 +669,107 @@ int sample_topp(float* probabilities, int n, float topp, ProbIndex* probindex) {
return probindex[last_idx].index; // in case of rounding errors
}
// ----------------------------------------------------------------------------
// int main
void error_usage() {
fprintf(stderr, "Usage: run <checkpoint> [options]\n");
fprintf(stderr, "Example: run model.bin -n 256 -i \"Once upon a time\"\n");
fprintf(stderr, "Options:\n");
fprintf(stderr, " -t <float> temperature, default 1.0\n");
fprintf(stderr, " -p <float> p value in top-p (nucleus) sampling. default 0.9\n");
fprintf(stderr, " -s <int> random seed, default time(NULL)\n");
fprintf(stderr, " -n <int> number of steps to run for, default 256. 0 = max_seq_len\n");
fprintf(stderr, " -i <string> input prompt\n");
fprintf(stderr, " -z <string> optional path to custom tokenizer\n");
exit(EXIT_FAILURE);
void build_sampler(Sampler* sampler, int vocab_size, float temperature, float topp, unsigned long long rng_seed) {
sampler->vocab_size = vocab_size;
sampler->temperature = temperature;
sampler->topp = topp;
sampler->rng_state = rng_seed;
// buffer only used with nucleus sampling; may not need but it's ~small
sampler->probindex = malloc(sampler->vocab_size * sizeof(ProbIndex));
}
int main(int argc, char *argv[]) {
void free_sampler(Sampler* sampler) {
free(sampler->probindex);
}
// default inits
char *checkpoint = NULL; // e.g. out/model.bin
char *tokenizer = "tokenizer.bin";
float temperature = 1.0f; // 0.0 = greedy deterministic. 1.0 = original. don't set higher
float topp = 0.9f; // top-p in nucleus sampling. 1.0 = off. 0.9 works well, but slower
rng_seed = 0; // seed rng with time by default
int steps = 256; // number of steps to run for
char *prompt = NULL; // prompt string
unsigned int random_u32(unsigned long long *state) {
// xorshift rng: https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
*state ^= *state >> 12;
*state ^= *state << 25;
*state ^= *state >> 27;
return (*state * 0x2545F4914F6CDD1Dull) >> 32;
}
float random_f32(unsigned long long *state) { // random float32 in [0,1)
return (random_u32(state) >> 8) / 16777216.0f;
}
// poor man's C argparse so we can override the defaults above from the command line
if (argc >= 2) { checkpoint = argv[1]; } else { error_usage(); }
for (int i = 2; i < argc; i+=2) {
// do some basic validation
if (i + 1 >= argc) { error_usage(); } // must have arg after flag
if (argv[i][0] != '-') { error_usage(); } // must start with dash
if (strlen(argv[i]) != 2) { error_usage(); } // must be -x (one dash, one letter)
// read in the args
if (argv[i][1] == 't') { temperature = atof(argv[i + 1]); }
else if (argv[i][1] == 'p') { topp = atof(argv[i + 1]); }
else if (argv[i][1] == 's') { rng_seed = atoi(argv[i + 1]); }
else if (argv[i][1] == 'n') { steps = atoi(argv[i + 1]); }
else if (argv[i][1] == 'i') { prompt = argv[i + 1]; }
else if (argv[i][1] == 'z') { tokenizer = argv[i + 1]; }
else { error_usage(); }
}
if(rng_seed == 0) { rng_seed = (unsigned int)time(NULL);}
// read in the model.bin file
Config config;
TransformerWeights weights;
int fd = 0; // file descriptor for memory mapping
float* data = NULL; // memory mapped data pointer
ssize_t file_size; // size of the checkpoint file in bytes
{
FILE *file = fopen(checkpoint, "rb");
if (!file) { fprintf(stderr, "Couldn't open file %s\n", checkpoint); return 1; }
// read in the config header
if (fread(&config, sizeof(Config), 1, file) != 1) { return 1; }
// negative vocab size is hacky way of signaling unshared weights. bit yikes.
int shared_weights = config.vocab_size > 0 ? 1 : 0;
config.vocab_size = abs(config.vocab_size);
// figure out the file size
fseek(file, 0, SEEK_END); // move file pointer to end of file
file_size = ftell(file); // get the file size, in bytes
fclose(file);
// memory map the Transformer weights into the data pointer
fd = open(checkpoint, O_RDONLY); // open in read only mode
if (fd == -1) { fprintf(stderr, "open failed!\n"); return 1; }
data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (data == MAP_FAILED) { fprintf(stderr, "mmap failed!\n"); return 1; }
float* weights_ptr = data + sizeof(Config)/sizeof(float);
checkpoint_init_weights(&weights, &config, weights_ptr, shared_weights);
}
// right now we cannot run for more than config.seq_len steps
if (steps <= 0 || steps > config.seq_len) { steps = config.seq_len; }
// read in the tokenizer .bin file
char** vocab = (char**)malloc(config.vocab_size * sizeof(char*));
float* vocab_scores = (float*)malloc(config.vocab_size * sizeof(float));
unsigned int max_token_length;
{
FILE *file = fopen(tokenizer, "rb");
if (!file) { fprintf(stderr, "couldn't load %s\n", tokenizer); return 1; }
if (fread(&max_token_length, sizeof(int), 1, file) != 1) { fprintf(stderr, "failed read\n"); return 1; }
int len;
for (int i = 0; i < config.vocab_size; i++) {
if (fread(vocab_scores + i, sizeof(float), 1, file) != 1) { fprintf(stderr, "failed read\n"); return 1;}
if (fread(&len, sizeof(int), 1, file) != 1) { fprintf(stderr, "failed read\n"); return 1; }
vocab[i] = (char *)malloc(len + 1);
if (fread(vocab[i], len, 1, file) != 1) { fprintf(stderr, "failed read\n"); return 1; }
vocab[i][len] = '\0'; // add the string terminating token
int sample(Sampler* sampler, float* logits) {
// sample the token given the logits and some hyperparameters
int next;
if (sampler->temperature == 0.0f) {
// greedy argmax sampling: take the token with the highest probability
next = sample_argmax(logits, sampler->vocab_size);
} else {
// apply the temperature to the logits
for (int q=0; q<sampler->vocab_size; q++) { logits[q] /= sampler->temperature; }
// apply softmax to the logits to get the probabilities for next token
softmax(logits, sampler->vocab_size);
// flip a (float) coin (this is our source of entropy for sampling)
float coin = random_f32(&sampler->rng_state);
// we sample from this distribution to get the next token
if (sampler->topp <= 0 || sampler->topp >= 1) {
// simply sample from the predicted probability distribution
next = sample_mult(logits, sampler->vocab_size, coin);
} else {
// top-p (nucleus) sampling, clamping the least likely tokens to zero
next = sample_topp(logits, sampler->vocab_size, sampler->topp, sampler->probindex, coin);
}
fclose(file);
}
return next;
}
// create and init the application RunState
RunState state;
malloc_run_state(&state, &config);
// ----------------------------------------------------------------------------
// utilities: time
// process the prompt, if any
int *prompt_tokens = NULL;
long time_in_ms() {
// return time in milliseconds, for benchmarking the model speed
struct timespec time;
clock_gettime(CLOCK_REALTIME, &time);
return time.tv_sec * 1000 + time.tv_nsec / 1000000;
}
// ----------------------------------------------------------------------------
// generation loop
void generate(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler, char *prompt, int steps) {
char *empty_prompt = "";
if (prompt == NULL) { prompt = empty_prompt; }
// encode the (string) prompt into tokens sequence
int num_prompt_tokens = 0;
if (prompt != NULL) {
prompt_tokens = (int*)malloc((strlen(prompt)+1) * sizeof(int));
bpe_encode(prompt, vocab, vocab_scores, config.vocab_size, max_token_length, prompt_tokens, &num_prompt_tokens);
int* prompt_tokens = (int*)malloc((strlen(prompt)+3) * sizeof(int)); // +3 for '\0', ?BOS, ?EOS
encode(tokenizer, prompt, 1, 0, prompt_tokens, &num_prompt_tokens);
if (num_prompt_tokens < 1) {
fprintf(stderr, "something is wrong, expected at least 1 prompt token\n");
exit(EXIT_FAILURE);
}
// start the main loop
long start = 0; // used to time our code, only initialized after first iteration
int next; // will store the next token in the sequence
int token = 1; // init with token 1 (=BOS), as done in Llama-2 sentencepiece tokenizer
int token = prompt_tokens[0]; // kick off with the first token in the prompt
int pos = 0; // position in the sequence
while (pos < steps) {
// forward the transformer to get logits for the next token
transformer(token, pos, &config, &state, &weights);
float* logits = forward(transformer, token, pos);
// advance the state state machine
if(pos < num_prompt_tokens) {
if (pos < num_prompt_tokens - 1) {
// if we are still processing the input prompt, force the next prompt token
next = prompt_tokens[pos];
next = prompt_tokens[pos + 1];
} else {
// sample the next token
if (temperature == 0.0f) {
// greedy argmax sampling: take the token with the highest probability
next = argmax(state.logits, config.vocab_size);
} else {
// apply the temperature to the logits
for (int q=0; q<config.vocab_size; q++) { state.logits[q] /= temperature; }
// apply softmax to the logits to get the probabilities for next token
softmax(state.logits, config.vocab_size);
// we sample from this distribution to get the next token
if (topp <= 0 || topp >= 1) {
// simply sample from the predicted probability distribution
next = sample(state.logits, config.vocab_size);
} else {
// top-p (nucleus) sampling, clamping the least likely tokens to zero
next = sample_topp(state.logits, config.vocab_size, topp, state.probindex);
}
}
// otherwise sample the next token from the logits
next = sample(sampler, logits);
}
pos++;
// data-dependent terminating condition: the BOS (1) token delimits sequences
// data-dependent terminating condition: the BOS (=1) token delimits sequences
if (next == 1) { break; }
// following BOS (1) token, sentencepiece decoder strips any leading whitespace (see PR #89)
char *token_str = (token == 1 && vocab[next][0] == ' ') ? vocab[next]+1 : vocab[next];
// careful, some tokens designate raw bytes, and look like e.g. '<0x01>'
unsigned char byte_val;
if (sscanf(token_str, "<0x%02hhX>", &byte_val) == 1) {
// ok this token is a raw byte token, carefuly to only print printable chars or whitespace
// some of the other bytes can be various control codes, backspace, etc. => skip
if (isprint(byte_val) || isspace(byte_val)) {
char byte_piece[2];
byte_piece[0] = byte_val;
byte_piece[1] = '\0';
printf("%s", byte_piece);
}
} else {
printf("%s", token_str);
}
// print the token as string, decode it with the Tokenizer object
char* piece = decode(tokenizer, token, next);
safe_printf(piece); // same as printf("%s", piece), but skips "unsafe" bytes
fflush(stdout);
token = next;
@@ -728,13 +784,195 @@ int main(int argc, char *argv[]) {
fprintf(stderr, "achieved tok/s: %f\n", (pos-1) / (double)(end-start)*1000);
}
free(prompt_tokens);
}
void read_stdin(const char* guide, char* buffer, size_t bufsize) {
// read a line from stdin, up to but not including \n
printf("%s", guide);
if (fgets(buffer, bufsize, stdin) != NULL) {
size_t len = strlen(buffer);
if (len > 0 && buffer[len - 1] == '\n') {
buffer[len - 1] = '\0'; // strip newline
}
}
}
// ----------------------------------------------------------------------------
// chat loop
// I manually inspected the tokens for a few chat conversations compared to
// python reference and that seemed ok, but this was not thoroughly tested and
// is not safely implemented, it's more a proof of concept atm.
void chat(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler,
char *cli_user_prompt, char *cli_system_prompt, int steps) {
// buffers for reading the system prompt and user prompt from stdin
// you'll notice they are soomewhat haphazardly and unsafely set atm
char system_prompt[512];
char user_prompt[512];
char rendered_prompt[1152];
int num_prompt_tokens = 0;
int* prompt_tokens = (int*)malloc(1152 * sizeof(int));
int user_idx;
// start the main loop
int8_t user_turn = 1; // user starts
int next; // will store the next token in the sequence
int token; // stores the current token to feed into the transformer
int prev_token;
int pos = 0; // position in the sequence
while (pos < steps) {
// when it is the user's turn to contribute tokens to the dialog...
if (user_turn) {
// get the (optional) system prompt at position 0
if (pos == 0) {
// at position 0, the user can also contribute a system prompt
if (cli_system_prompt == NULL) {
// system prompt was not passed in, attempt to get it from stdin
read_stdin("Enter system prompt (optional): ", system_prompt, sizeof(system_prompt));
} else {
// system prompt was passed in, use it
strcpy(system_prompt, cli_system_prompt);
}
}
// get the user prompt
if (pos == 0 && cli_user_prompt != NULL) {
// user prompt for position 0 was passed in, use it
strcpy(user_prompt, cli_user_prompt);
} else {
// otherwise get user prompt from stdin
read_stdin("User: ", user_prompt, sizeof(user_prompt));
}
// render user/system prompts into the Llama 2 Chat schema
if (pos == 0 && system_prompt[0] != '\0') {
char system_template[] = "[INST] <<SYS>>\n%s\n<</SYS>>\n\n%s [/INST]";
sprintf(rendered_prompt, system_template, system_prompt, user_prompt);
} else {
char user_template[] = "[INST] %s [/INST]";
sprintf(rendered_prompt, user_template, user_prompt);
}
// encode the rendered prompt into tokens
encode(tokenizer, rendered_prompt, 1, 0, prompt_tokens, &num_prompt_tokens);
user_idx = 0; // reset the user index
user_turn = 0;
printf("Assistant: ");
}
// determine the token to pass into the transformer next
if (user_idx < num_prompt_tokens) {
// if we are still processing the input prompt, force the next prompt token
token = prompt_tokens[user_idx++];
} else {
// otherwise use the next token sampled from previous turn
token = next;
}
// EOS (=2) token ends the Assistant turn
if (token == 2) { user_turn = 1; }
// forward the transformer to get logits for the next token
float* logits = forward(transformer, token, pos);
next = sample(sampler, logits);
pos++;
if (user_idx >= num_prompt_tokens && next != 2) {
// the Assistant is responding, so print its output
char* piece = decode(tokenizer, token, next);
safe_printf(piece); // same as printf("%s", piece), but skips "unsafe" bytes
fflush(stdout);
}
if (next == 2) { printf("\n"); }
}
printf("\n");
free(prompt_tokens);
}
// ----------------------------------------------------------------------------
// CLI, include only if not testing
#ifndef TESTING
void error_usage() {
fprintf(stderr, "Usage: run <checkpoint> [options]\n");
fprintf(stderr, "Example: run model.bin -n 256 -i \"Once upon a time\"\n");
fprintf(stderr, "Options:\n");
fprintf(stderr, " -t <float> temperature in [0,inf], default 1.0\n");
fprintf(stderr, " -p <float> p value in top-p (nucleus) sampling in [0,1] default 0.9\n");
fprintf(stderr, " -s <int> random seed, default time(NULL)\n");
fprintf(stderr, " -n <int> number of steps to run for, default 256. 0 = max_seq_len\n");
fprintf(stderr, " -i <string> input prompt\n");
fprintf(stderr, " -z <string> optional path to custom tokenizer\n");
fprintf(stderr, " -m <string> mode: generate|chat, default: generate\n");
fprintf(stderr, " -y <string> (optional) system prompt in chat mode\n");
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
// default parameters
char *checkpoint_path = NULL; // e.g. out/model.bin
char *tokenizer_path = "tokenizer.bin";
float temperature = 1.0f; // 0.0 = greedy deterministic. 1.0 = original. don't set higher
float topp = 0.9f; // top-p in nucleus sampling. 1.0 = off. 0.9 works well, but slower
int steps = 256; // number of steps to run for
char *prompt = NULL; // prompt string
unsigned long long rng_seed = 0; // seed rng with time by default
char *mode = "generate"; // generate|chat
char *system_prompt = NULL; // the (optional) system prompt to use in chat mode
// poor man's C argparse so we can override the defaults above from the command line
if (argc >= 2) { checkpoint_path = argv[1]; } else { error_usage(); }
for (int i = 2; i < argc; i+=2) {
// do some basic validation
if (i + 1 >= argc) { error_usage(); } // must have arg after flag
if (argv[i][0] != '-') { error_usage(); } // must start with dash
if (strlen(argv[i]) != 2) { error_usage(); } // must be -x (one dash, one letter)
// read in the args
if (argv[i][1] == 't') { temperature = atof(argv[i + 1]); }
else if (argv[i][1] == 'p') { topp = atof(argv[i + 1]); }
else if (argv[i][1] == 's') { rng_seed = atoi(argv[i + 1]); }
else if (argv[i][1] == 'n') { steps = atoi(argv[i + 1]); }
else if (argv[i][1] == 'i') { prompt = argv[i + 1]; }
else if (argv[i][1] == 'z') { tokenizer_path = argv[i + 1]; }
else if (argv[i][1] == 'm') { mode = argv[i + 1]; }
else if (argv[i][1] == 'y') { system_prompt = argv[i + 1]; }
else { error_usage(); }
}
// parameter validation/overrides
if (rng_seed <= 0) rng_seed = (unsigned int)time(NULL);
if (temperature < 0.0) temperature = 0.0;
if (topp < 0.0 || 1.0 < topp) topp = 0.9;
if (steps < 0) steps = 0;
// build the Transformer via the model .bin file
Transformer transformer;
build_transformer(&transformer, checkpoint_path);
if (steps == 0 || steps > transformer.config.seq_len) steps = transformer.config.seq_len; // ovrerride to ~max length
// build the Tokenizer via the tokenizer .bin file
Tokenizer tokenizer;
build_tokenizer(&tokenizer, tokenizer_path, transformer.config.vocab_size);
// build the Sampler
Sampler sampler;
build_sampler(&sampler, transformer.config.vocab_size, temperature, topp, rng_seed);
// run!
if (strcmp(mode, "generate") == 0) {
generate(&transformer, &tokenizer, &sampler, prompt, steps);
} else if (strcmp(mode, "chat") == 0) {
chat(&transformer, &tokenizer, &sampler, prompt, system_prompt, steps);
} else {
fprintf(stderr, "unknown mode: %s\n", mode);
error_usage();
}
// memory and file handles cleanup
free_run_state(&state);
for (int i = 0; i < config.vocab_size; i++) { free(vocab[i]); }
free(vocab);
free(vocab_scores);
if (prompt_tokens != NULL) free(prompt_tokens);
if (data != MAP_FAILED) munmap(data, file_size);
if (fd != -1) close(fd);
free_sampler(&sampler);
free_tokenizer(&tokenizer);
free_transformer(&transformer);
return 0;
}
#endif
+1 -1
View File
@@ -52,7 +52,7 @@ if compile:
model = torch.compile(model) # requires PyTorch 2.0 (optional)
# load the tokenizer
vocab_source = checkpoint_dict.get("vocab_source", "llama2")
vocab_source = checkpoint_dict["config"].get("vocab_source", "llama2")
vocab_size = gptconf.vocab_size
if tokenizer:
# a specific tokenizer is provided, use it
-66
View File
@@ -1,66 +0,0 @@
#!/usr/bin/env python
"""Saves the model as a TorchScript.
Usage examples:
./save_torchscript.py
./save_torchscript.py --dim=300
./save_torchscript.py --gzip_output=True --zero_params=True
The resulting file can be loaded in C++ code and then used for training or
inference with:
#include <torch/script.h>
torch::jit::Module module = torch::jit::load("model.pt")
Note that the serialized model includes the initial parameters and with the default
ModelArgs the file is 59M and gzips down to 55M. If you want to serialize/distribute
the model parameters separately you can zero out the parameters before saving it and
it will gzip down to 780K.
"""
import gzip
import os
import shutil
from inspect import signature
import torch
from model import ModelArgs, Transformer
# Model args config
dim = 288
n_layers = 6
n_heads = 6
n_kv_heads = n_heads
multiple_of = 32
max_seq_len = 256
dropout = 0.0
vocab_size = 32000
norm_eps = 1e-5
# Save config
model_path = "model.pt"
zero_params = False
gzip_output = False
# Allow config overrides
exec(open("configurator.py").read())
def main() -> None:
model_args = {k: globals()[k] for k in signature(ModelArgs).parameters}
model = Transformer(ModelArgs(**model_args))
# If requested zero params before saving the model. This is useful in
# conjunction with gzip_output.
if zero_params:
for p in model.parameters():
p.detach().zero_()
torch.jit.save(torch.jit.script(model), model_path)
if gzip_output:
with open(model_path, "rb") as f_in:
with gzip.open(f"{model_path}.gz", "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
os.unlink(model_path)
if __name__ == "__main__":
main()
+84
View File
@@ -0,0 +1,84 @@
#define TESTING
#include "run.c"
void assert_eq(int a, int b) {
if (a != b) {
printf("Assertion failed: %d != %d\n", a, b);
exit(EXIT_FAILURE);
}
}
void test_prompt_encoding(Tokenizer* tokenizer, char* prompt, int* expected_tokens, int num_expected_tokens) {
// encode
int* prompt_tokens = (int*)malloc((strlen(prompt)+3) * sizeof(int));
int num_prompt_tokens = 0; // the total number of prompt tokens
encode(tokenizer, prompt, 1, 0, prompt_tokens, &num_prompt_tokens);
#if VERBOSITY == 1
// print maybe
printf("expected tokens:\n");
for (int i = 0; i < num_expected_tokens; i++) printf("%d ", expected_tokens[i]);
printf("\n");
printf("actual tokens:\n");
for (int i = 0; i < num_prompt_tokens; i++) printf("%d ", prompt_tokens[i]);
printf("\n");
#endif
// verify
assert_eq(num_prompt_tokens, num_expected_tokens);
for (int i = 0; i < num_prompt_tokens; i++) {
assert_eq(prompt_tokens[i], expected_tokens[i]);
}
#if VERBOSITY == 1
printf("OK\n");
printf("---\n");
#endif
free(prompt_tokens);
}
void test_prompt_encodings() {
// let's verify that the Tokenizer works as expected
char *tokenizer_path = "tokenizer.bin";
int vocab_size = 32000;
Tokenizer tokenizer;
build_tokenizer(&tokenizer, tokenizer_path, vocab_size);
// test 0 (test the empty string) (I added this as a simple case)
char *prompt0 = "";
int expected_tokens0[] = {1};
test_prompt_encoding(&tokenizer, prompt0, expected_tokens0, sizeof(expected_tokens0) / sizeof(int));
// the tests below are taken from the Meta Llama 2 repo example code
// https://github.com/facebookresearch/llama/blob/main/example_text_completion.py
// and the expected tokens come from me breaking in the debugger in Python
// test 1
char *prompt = "I believe the meaning of life is";
int expected_tokens[] = {1, 306, 4658, 278, 6593, 310, 2834, 338};
test_prompt_encoding(&tokenizer, prompt, expected_tokens, sizeof(expected_tokens) / sizeof(int));
// test 2
char* prompt2 = "Simply put, the theory of relativity states that ";
int expected_tokens2[] = {1, 3439, 17632, 1925, 29892, 278, 6368, 310, 14215, 537, 5922, 393, 29871};
test_prompt_encoding(&tokenizer, prompt2, expected_tokens2, sizeof(expected_tokens2) / sizeof(int));
// test 3
char* prompt3 = "A brief message congratulating the team on the launch:\n\n Hi everyone,\n\n I just ";
int expected_tokens3[] = {1, 319, 11473, 2643, 378, 629, 271, 18099, 278, 3815, 373, 278, 6826, 29901, 13, 13, 4706, 6324, 14332, 29892, 13, 13, 4706, 306, 925, 29871};
test_prompt_encoding(&tokenizer, prompt3, expected_tokens3, sizeof(expected_tokens3) / sizeof(int));
// test 4
char* prompt4 = "Translate English to French:\n\n sea otter => loutre de mer\n peppermint => menthe poivrée\n plush girafe => girafe peluche\n cheese =>";
int expected_tokens4[] = {1, 4103, 9632, 4223, 304, 5176, 29901, 13, 13, 4706, 7205, 4932, 357, 1149, 301, 449, 276, 316, 2778, 13, 4706, 1236, 407, 837, 524, 1149, 6042, 354, 772, 440, 29878, 1318, 13, 4706, 715, 1878, 330, 3055, 1725, 1149, 330, 3055, 1725, 4639, 28754, 13, 4706, 923, 968, 1149};
test_prompt_encoding(&tokenizer, prompt4, expected_tokens4, sizeof(expected_tokens4) / sizeof(int));
// memory and file handles cleanup
free_tokenizer(&tokenizer);
}
int main(int argc, char *argv[]) {
test_prompt_encodings();
printf("ALL OK\n");
}
+17 -10
View File
@@ -13,6 +13,7 @@ from functools import partial
import numpy as np
import requests
import sentencepiece as spm
import torch
import torch.distributed as dist
from tqdm import tqdm
@@ -97,16 +98,21 @@ def train_vocab(vocab_size):
of.write(text + "\n")
print(f"Size is: {os.path.getsize(tiny_file) / 1024 / 1024:.2f} MB")
# 2) run the train_vocab.sh script that trains the sentencepiece model
print("Will now train the vocab with:")
cmd = f"bash train_vocab.sh {tiny_file} {prefix} {vocab_size}"
print(cmd)
print("OK? [y/N] ")
dec = input()
if dec.lower() != "y":
print("Exiting...")
return
os.system(cmd)
# 2) train the sentencepiece model
print("Will now train the vocab...")
spm.SentencePieceTrainer.train(input=tiny_file,
model_prefix=prefix,
model_type="bpe",
vocab_size=vocab_size,
self_test_sample_size=0,
input_format="text",
character_coverage=1.0,
num_threads=os.cpu_count(),
split_digits=True,
allow_whitespace_only_pieces=True,
byte_fallback=True,
unk_surface=r" \342\201\207 ",
normalization_rule_name="identity")
# 3) optional cleanup, ask the user if they'd like to delete tiny.txt
dec = input(f"Delete the temporary file {tiny_file}? [y/N] ")
@@ -196,6 +202,7 @@ class PretokDataset(torch.utils.data.IterableDataset):
shard_filenames = sorted(glob.glob(os.path.join(bin_dir, "*.bin")))
# train/test split. let's use only shard 0 for test split, rest train
shard_filenames = shard_filenames[1:] if self.split == "train" else shard_filenames[:1]
assert len(shard_filenames)>0, f"No bin files found in {bin_dir}"
while True:
rng.shuffle(shard_filenames)
for shard in shard_filenames:
+3 -2
View File
@@ -29,6 +29,7 @@ from torch.distributed import destroy_process_group, init_process_group
from torch.nn.parallel import DistributedDataParallel as DDP
from tinystories import Task
from export import model_export
# -----------------------------------------------------------------------------
# I/O
@@ -270,7 +271,7 @@ while True:
"loss/val": losses["val"],
"lr": lr,
"mfu": running_mfu * 100, # convert to percentage
}
}, step = iter_num
)
except Exception as e:
print(f"logging to wandb failed: {e}")
@@ -287,7 +288,7 @@ while True:
}
print(f"saving checkpoint to {out_dir}")
torch.save(checkpoint, os.path.join(out_dir, "ckpt.pt"))
raw_model.export(os.path.join(out_dir, "model.bin"))
model_export(raw_model, os.path.join(out_dir, "model.bin"), version=0)
if iter_num == 0 and eval_only:
break
-126
View File
@@ -1,126 +0,0 @@
#!/bin/bash
# Trains a sentencepiece tokenizer model on a bunch of given data, my best
# effort attempt to replicate how Meta trained their Llama 2 tokenizer.
# usage: $ train_vocab.sh <input> <model_prefix> <vocab_size>
# example:
# ./train_vocab.sh tiny.txt tokenizer_tiny 1024
# requirements:
# install https://github.com/google/sentencepiece
# check if the correct number of arguments are provided
if [ $# -ne 3 ]; then
echo "Usage: $0 <input> <model_prefix> <vocab_size>"
exit 1
fi
# assign command-line arguments to variables
input=$1
model_prefix=$2
vocab_size=$3
# check if input file exists
if [ ! -f "$input" ]; then
echo "Usage: $0 <input> <model_prefix> <vocab_size>"
echo "input '$input' not found."
exit 1
fi
# check if vocab_size is a positive integer
if ! [[ "$vocab_size" =~ ^[0-9]+$ ]] || [ "$vocab_size" -lt 1 ]; then
echo "Usage: $0 <input> <model_prefix> <vocab_size>"
echo "vocab_size size must be a positive integer."
exit 1
fi
# Print the processed inputs
echo "Input: $input"
echo "Model Prefix: $model_prefix"
echo "Vocabulary Size: $vocab_size"
# train a sentencepiece tokenizer model
# Llama 2 config can be printed as follows:
# import sentencepiece.sentencepiece_model_pb2
# mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
# mp.ParseFromString(open("tokenizer.model", "rb").read())
# print(mp.trainer_spec)
# print(mp.normalizer_spec)
# this gives:
# trainer_spec {
# input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged"
# model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2"
# model_type: BPE
# vocab_size: 32000
# self_test_sample_size: 0
# input_format: "text"
# character_coverage: 0.9999499917030334
# input_sentence_size: 200000000
# seed_sentencepiece_size: 1000000
# shrinking_factor: 0.75
# num_threads: 80
# num_sub_iterations: 2
# max_sentence_length: 4192
# shuffle_input_sentence: true
# max_sentencepiece_length: 16
# split_by_unicode_script: true
# split_by_whitespace: true
# split_by_number: true
# treat_whitespace_as_suffix: false
# split_digits: true
# allow_whitespace_only_pieces: true
# vocabulary_output_piece_score: true
# hard_vocab_limit: true
# use_all_vocab: false
# byte_fallback: true
# required_chars: ""
# unk_id: 0
# bos_id: 1
# eos_id: 2
# pad_id: -1
# unk_surface: " \342\201\207 "
# unk_piece: "<unk>"
# bos_piece: "<s>"
# eos_piece: "</s>"
# pad_piece: "<pad>"
# train_extremely_large_corpus: false
# enable_differential_privacy: false
# differential_privacy_noise_level: 0.0
# differential_privacy_clipping_threshold: 0
# }
# normalizer_spec {
# name: "identity"
# precompiled_charsmap: ""
# add_dummy_prefix: true
# remove_extra_whitespaces: false
# normalization_rule_tsv: ""
# }
# let's now use spm_train to train this exact model
# options docs: https://github.com/google/sentencepiece/blob/master/doc/options.md
# we'll depart on a few settings:
# character_coverage -> 1.0
# other important notes:
# --split-digits = true, per the paper
# --allow_whitespace_only_pieces is true, default in spm is false
# --byte_fallback is true, default in spm is false
# --normalization_rule_name is identity, default in spm is nmt_nfkc
spm_train --input="$input" \
--model_prefix="$model_prefix" \
--model_type=bpe \
--vocab_size="$vocab_size" \
--self_test_sample_size=0 \
--input_format="text" \
--character_coverage=1.0 \
--num_threads="$(nproc)" \
--split_digits=true \
--allow_whitespace_only_pieces=true \
--byte_fallback=true \
--unk_surface=" \342\201\207 " \
--normalization_rule_name=identity \