llama2.c

Author	SHA1	Message	Date
Andrej Karpathy	4c6f0af9ff	add the ability to train a custom sentencepiece tokenizer with a given vocab_size, and pretok with it. some more changes still needed to merge this branch, in train.py and ofc run.c. did this in a sadly bit ugly, but fully backwards compatible way. basically when we use custom tokenizer we create a whole new directory structure for that	2023-08-11 03:58:22 +00:00