Merge branch 'master' into feature/chat

This commit is contained in:
Andrej
2023-08-25 07:58:33 -07:00
committed by GitHub
4 changed files with 21 additions and 14 deletions
+3 -2
View File
@@ -8,7 +8,7 @@ Train the Llama 2 LLM architecture in PyTorch then inference it with one simple
As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing.
Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compred to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.
Please note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https://github.com/karpathy/nanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https://github.com/ggerganov/llama.cpp) for inspiring this project. Compared to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.
## feel the magic
@@ -175,7 +175,7 @@ python tinystories.py train_vocab --vocab_size=4096
python tinystories.py pretokenize --vocab_size=4096
```
The `train_vocab` stage will call the `train_vocab.sh` script, which calls the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.
The `train_vocab` stage will call the `sentencepiece` library to train the tokenizer, storing it in a new file `data/tok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.
A quick note of interest is that vocab size of 4096 trained specifically on tinystories creates integer sequences with about the same sequence length per example as the default Llama 2 tokenizer of 32000 tokens! This means that our custom, tailored tokenizer is a lot better adapted to our specific text, and can compress it very effectively. So our trained models are smaller and faster.
@@ -339,6 +339,7 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
- runq.c (int8 quantization) add
- run.cu (CUDA) investigate and merge
- add more tests inside [test.c](test.c)
- add Engine class for use in sample.py that does efficient inference in PyTorch, e.g. KV cache keeping
- make it easier to add a new dataset with not too much pain
- (LoRA) finetuning and export of Llama 2 models
+1 -1
View File
@@ -949,7 +949,7 @@ int main(int argc, char *argv[]) {
// build the Transformer via the model .bin file
Transformer transformer;
build_transformer(&transformer, checkpoint_path);
if (steps == 0) steps = transformer.config.seq_len; // ovrerride to ~max length
if (steps == 0 || steps > transformer.config.seq_len) steps = transformer.config.seq_len; // ovrerride to ~max length
// build the Tokenizer via the tokenizer .bin file
Tokenizer tokenizer;
+16 -10
View File
@@ -13,6 +13,7 @@ from functools import partial
import numpy as np
import requests
import sentencepiece as spm
import torch
import torch.distributed as dist
from tqdm import tqdm
@@ -97,16 +98,21 @@ def train_vocab(vocab_size):
of.write(text + "\n")
print(f"Size is: {os.path.getsize(tiny_file) / 1024 / 1024:.2f} MB")
# 2) run the train_vocab.sh script that trains the sentencepiece model
print("Will now train the vocab with:")
cmd = f"bash train_vocab.sh {tiny_file} {prefix} {vocab_size}"
print(cmd)
print("OK? [y/N] ")
dec = input()
if dec.lower() != "y":
print("Exiting...")
return
os.system(cmd)
# 2) train the sentencepiece model
print("Will now train the vocab...")
spm.SentencePieceTrainer.train(input=tiny_file,
model_prefix=prefix,
model_type="bpe",
vocab_size=vocab_size,
self_test_sample_size=0,
input_format="text",
character_coverage=1.0,
num_threads=os.cpu_count(),
split_digits=True,
allow_whitespace_only_pieces=True,
byte_fallback=True,
unk_surface=r" \342\201\207 ",
normalization_rule_name="identity")
# 3) optional cleanup, ask the user if they'd like to delete tiny.txt
dec = input(f"Delete the temporary file {tiny_file}? [y/N] ")
+1 -1
View File
@@ -271,7 +271,7 @@ while True:
"loss/val": losses["val"],
"lr": lr,
"mfu": running_mfu * 100, # convert to percentage
}
}, step = iter_num
)
except Exception as e:
print(f"logging to wandb failed: {e}")