8 Commits

Author SHA1 Message Date
Andrej Karpathy 096325b66c bring back num_threads 2023-08-24 03:09:55 +00:00
Jani Monoses fe9b9f2f15 Train vocab in Python 2023-08-23 19:10:28 +03:00
rahulschand fbefeec1b1 add assert message to give better warning 2023-08-19 13:05:26 +05:30
Mihai Nadăș 570789aa04 Fixes https://github.com/karpathy/llama2.c/issues/280
There was a small bug in tinystories.py, described here: https://github.com/karpathy/llama2.c/issues/280

This commit simply passes vocab_size to get_tokenizer_model_path to avoid silent crash when processing shards (in process_shard)
2023-08-13 17:49:10 +03:00
Andrej Karpathy b0cfa2458d ok i can train and sample a model with a custom tokenizer 2023-08-11 16:47:29 +00:00
Andrej Karpathy 4c6f0af9ff add the ability to train a custom sentencepiece tokenizer with a given vocab_size, and pretok with it. some more changes still needed to merge this branch, in train.py and ofc run.c. did this in a sadly bit ugly, but fully backwards compatible way. basically when we use custom tokenizer we create a whole new directory structure for that 2023-08-11 03:58:22 +00:00
Milos Cubrilo af3f3a7b31 Speed up tinystories pretokenize command 2023-07-29 03:08:33 +02:00
Andrej Karpathy 5b161abb9a somewhere ~20 hours later 2023-07-23 05:23:45 +00:00