From 1d14cb8dd8884eefa3f15d06263ec4ab95a4b703 Mon Sep 17 00:00:00 2001 From: Andrej Karpathy Date: Sun, 13 Aug 2023 03:19:35 +0000 Subject: [PATCH] add note about 4096 vs 32000 token size on tinystories --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 95fd98a..331bb7a 100644 --- a/README.md +++ b/README.md @@ -163,6 +163,8 @@ python tinystories.py pretokenize --vocab_size=4096 The `train_vocab` stage will call the `train_vocab.sh` script, which calls the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size. +A quick note of interest is that vocab size of 4096 trained specifically on tinystories creates integer sequences with about the same sequence length per example as the default Llama 2 tokenizer of 32000 tokens! This means that our custom, tailored tokenizer is a lot better adapted to our specific text, and can compress it very effectively. So our trained models are smaller and faster. + Now that we have pretokenized the dataset with our custom tokenizer, we can train the model. The training script `train.py` doesn't care about the exact tokens, it only cares about the vocabulary size so it can correctly initialize the model. So when training your model, make sure to pass in ```