Commit Graph

294 Commits

Author SHA1 Message Date
Andrej 8b472ded1f Merge pull request #272 from karpathy/feature/customtokenizer
Big Change: Custom Tokenizer training: add the ability to train custom tokenizers instead of using the pretrained Llama 2 tokenizer. This is useful in custom, narrow-domain LLMs because smaller vocab sizes make much smaller, faster, and potentially more capable models. For example, in tinystories a vocab size 4096 custom tokenizer compresses the input text sequences about as well as the Llama 2 tokenizer with vocab size 32000. The result is also "safer" because a badly trained model can't accidentally e.g. output some random chinese character and rapidly go "off the rails" in subsequent tokens.
2023-08-12 20:31:21 -07:00
Andrej Karpathy 9ff459b925 todo changes 2023-08-13 03:24:31 +00:00
Andrej Karpathy 1d14cb8dd8 add note about 4096 vs 32000 token size on tinystories 2023-08-13 03:19:35 +00:00
Andrej Karpathy fe49eb222c readme for custom tokenizers 2023-08-13 03:16:18 +00:00
Andrej Karpathy 9c3cfb46a3 make default be the llama2 tokenizer 2023-08-13 03:08:07 +00:00
Andrej Karpathy 00a61dc7f9 remove the tinyshakespeare dataset until i can bring it back later in a nicer form, otherwise right now we just have a ton of copy paste code here 2023-08-13 02:18:30 +00:00
Andrej Karpathy f5fc0c245f final piece: run.c support for new tokenizer, super ez 2023-08-13 02:12:13 +00:00
Andrej Karpathy ea4cedc588 add ability to export custom tokenizer to .bin format for run.c file 2023-08-13 02:00:19 +00:00
Andrej Karpathy b0cfa2458d ok i can train and sample a model with a custom tokenizer 2023-08-11 16:47:29 +00:00
Andrej Karpathy 4c6f0af9ff add the ability to train a custom sentencepiece tokenizer with a given vocab_size, and pretok with it. some more changes still needed to merge this branch, in train.py and ofc run.c. did this in a sadly bit ugly, but fully backwards compatible way. basically when we use custom tokenizer we create a whole new directory structure for that 2023-08-11 03:58:22 +00:00
Andrej Karpathy c42641205f turn off topp sampling by default because it is a bit too slow to be the default. it is likely that turning it on, e.g. -p 0.9 is midlly higher quality and safer samples, but this comes at a cost of too much performance in double digit percent sometimes, for it to be on by default i think... 2023-08-10 15:23:05 +00:00
Andrej Karpathy 3f69c6cdc4 change the default to use runfast, which imo works just fine 2023-08-10 05:06:49 +00:00
Andrej 5f8068fd43 Merge pull request #260 from madroidmaq/master
Add Jupyter notebook for easier feel the magic
2023-08-09 22:03:36 -07:00
Andrej f60285ee78 Merge pull request #264 from trrahul/master
Added C# port information in readme
2023-08-09 22:00:23 -07:00
Andrej 04121d1b85 Merge pull request #256 from rdentato/patch-rng-seed
Patch rng seed
2023-08-09 21:56:07 -07:00
Rahul TR 256e7f885b Added C# port information in readme 2023-08-09 17:59:47 +05:30
Andrej Karpathy e36e3fb50d Merge branch 'master' of github.com:karpathy/llama2.c 2023-08-09 02:08:37 +00:00
Andrej Karpathy 96873b0274 refine todos section make more concrete and sort 2023-08-09 02:08:33 +00:00
madroid 9713609023 Add Colab GUI: select model/temperature/prompt/etc 2023-08-08 20:29:53 +08:00
madroid 27c5fc76b1 Add Google Colab button 2023-08-08 01:50:19 +08:00
madroid 57ca3c0401 Add run.ipynb for easier feel the magic 2023-08-08 01:32:51 +08:00
rdentato ff6a2f0a7a Reset the #include <omp.h> 2023-08-07 07:28:03 +00:00
rdentato e49c16caa5 Changed how rng_seed is handled. Now 0 is treated as time(NULL). 2023-08-07 06:51:57 +00:00
Remo Dentato 2e5fad83da Merge branch 'karpathy:master' into master 2023-08-07 07:57:42 +02:00
Andrej 3c3b19b14c Merge pull request #242 from tairov/llama2-py
Add a link to simple one file pure Python port
2023-08-06 19:51:30 -07:00
Andrej f4f4cae4cb Merge pull request #241 from danielgrittner/master
add a Rust port
2023-08-06 19:51:13 -07:00
Andrej 09de2cc4ca Merge pull request #250 from npinto/master-1
FIX: model.generate(); forward() only returns logits now.
2023-08-06 18:43:01 -07:00
Nicolas Pinto 98b515e44d FIX: model.generate()
This patch fixes a simple bug in `generate()` due to model's `forward()` only returning logits and not losses since `f2e34e6b0ac55accd6ba930a04c6f683f5158b29`.
2023-08-06 14:48:47 -07:00
rdentato 999b1bf776 Added conditinal include of the OpenMP header. 2023-08-06 21:07:09 +00:00
Aydyn Tairov 2297d158e3 Fix link to a github profile 2023-08-06 21:47:05 +01:00
Daniel Grittner 512f039d5d Merge branch 'master' into master 2023-08-06 19:55:43 +02:00
Aydyn Tairov 6734eaeff5 Rebase chanes to master 2023-08-06 18:47:05 +01:00
Aydyn Tairov 7178facb75 Rebase changes to master 2023-08-06 18:45:47 +01:00
Andrej Karpathy a7a3aa09b8 Merge branch 'master' of github.com:karpathy/llama2.c 2023-08-06 16:33:36 +00:00
Andrej Karpathy 79791f39b4 let's start respecting the BOS token. Don't print it explicitly, and terminate sequence if it appears. This makes sense especially after the recent addition of prompting. Also be careful with timings and making sure they come out right if we exit early in this data-dependent manner 2023-08-06 16:33:23 +00:00
Andrej Karpathy 4e8a3e8d5d fix style issue space with stderr printing 2023-08-06 15:51:58 +00:00
Andrej 7af81ded7e Merge pull request #244 from madroidmaq/master
Update README.md: format notable forks
2023-08-06 08:43:24 -07:00
Andrej a25958fd45 Merge pull request #245 from rdentato/patch-stderr
Errors and info on stderr
2023-08-06 08:42:09 -07:00
Madroid Ma 1f53735d12 Merge branch 'karpathy:master' into master 2023-08-06 18:18:36 +08:00
rdentato 9cfb7efb85 Changed all the printf() for error/info messages so that they print on stderr. 2023-08-06 09:53:02 +00:00
madroid baefaaaf76 Update README.md: add notable forks author's link 2023-08-06 17:42:31 +08:00
Daniel Grittner fcb4cdef8b add a Rust port 2023-08-06 10:44:48 +02:00
Andrej Karpathy 623894f5da fix bug, have to use raw_model not model to access the loss 2023-08-06 07:55:46 +00:00
Andrej Karpathy 65b0846637 error on seed=0 2023-08-06 07:31:21 +00:00
Andrej Karpathy 8931d5092e add nucleus sampling. it costs lines of code, but i think thit is the default best way to sample, so it is important to have 2023-08-06 07:22:39 +00:00
madroid 8c1f1b280f Update README.md: format notable forks 2023-08-06 14:23:57 +08:00
Andrej Karpathy 49e3ff6d08 update makefile to use correct arg call after our argparse update 2023-08-05 23:11:11 +00:00
Andrej Karpathy a1037d79ee turned on trimTrailingWhitespace in my vscode sorry about that 2023-08-05 22:46:35 +00:00
Andrej a2962b9a0c Merge pull request #195 from clebert/prompt-tokens-size
Adjust `malloc` size for `prompt_tokens`
2023-08-05 15:29:33 -07:00
Andrej bdf3a6c22c Merge pull request #167 from mzcu/pretokenize-speedup
Speed up tinystories pretokenize command
2023-08-05 15:14:51 -07:00