add the 110m model, as it finished training

This commit is contained in:
Andrej Karpathy
2023-07-25 15:00:57 +00:00
parent 05ee4cbf38
commit 94730f1766
+10 -6
View File
@@ -39,6 +39,10 @@ This still runs at interactive rates and samples more coherent and diverse stori
*Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.*
**Update 2**: The 110M param model is also available now, see [models](#models).
```bash
## Meta's Llama 2 models
As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. First you'll have to export these weights in the llama2.c format. Git clone the main repo from Meta, and cp the `export_meta_llama_bin.py` file (in the root directory of this project) over, and run it:
@@ -66,13 +70,13 @@ base models... ¯\\_(ツ)_/¯. Since we can inference the base model, it should
For the sake of examples of smaller, from-scratch models, I trained multiple models on TinyStories and catalogue them here:
| model | dim | n_layers | n_heads | max context length | parameters | download
| --- | --- | --- | --- | --- | --- | --- |
| OG | 288 | 6 | 6 | 256 | 15M | [model.bin](https://karpathy.ai/llama2c/model.bin) |
| 44M| 512 | 8 | 8 | 1024 | 44M | [model44m.bin](https://karpathy.ai/llama2c/model44m.bin) |
| 120M| 768 | 12 | 12 | 1024 | 120M | training... |
| model | dim | n_layers | n_heads | max context length | parameters | val loss | download
| --- | --- | --- | --- | --- | --- | --- | --- |
| OG | 288 | 6 | 6 | 256 | 15M | | [model.bin](https://karpathy.ai/llama2c/model.bin) |
| 44M| 512 | 8 | 8 | 1024 | 44M | | [model44m.bin](https://karpathy.ai/llama2c/model44m.bin) |
| 110M| 768 | 12 | 12 | 1024 | 110M | 0.7601 | [model110m.bin](https://karpathy.ai/llama2c/model110m.bin) |
You'll notice that the 120M model is roughly equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (`GPT-2 small`), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1/2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute/learned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery (but this is not yet supported in llama2.c).
You'll notice that the 110M model is equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (`GPT-2 small`), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1/2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute/learned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery (but this is not yet supported in llama2.c).
## howto