Merge branch 'master' into better-rng
This commit is contained in:
@@ -0,0 +1,96 @@
|
||||
name: Continuous Integration
|
||||
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- master
|
||||
paths: ['.github/workflows/**', '**/Makefile', '**/*.c', '**/*.h']
|
||||
pull_request:
|
||||
types: [opened, synchronize, reopened]
|
||||
paths: ['**/Makefile', '**/*.c', '**/*.h']
|
||||
|
||||
env:
|
||||
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
|
||||
|
||||
jobs:
|
||||
# check basic builds to avoid breaking changes
|
||||
ubuntu-focal-make:
|
||||
runs-on: ubuntu-20.04
|
||||
|
||||
steps:
|
||||
- name: Clone
|
||||
id: checkout
|
||||
uses: actions/checkout@v3
|
||||
|
||||
- name: Dependencies
|
||||
id: depends
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install build-essential -y
|
||||
|
||||
- name: Build
|
||||
id: make_build
|
||||
run: |
|
||||
make
|
||||
|
||||
- name: Build runfast
|
||||
id: make_build_runfast
|
||||
run: |
|
||||
make runfast
|
||||
|
||||
macOS-latest-make:
|
||||
runs-on: macos-latest
|
||||
|
||||
steps:
|
||||
- name: Clone
|
||||
id: checkout
|
||||
uses: actions/checkout@v3
|
||||
|
||||
- name: Dependencies
|
||||
id: depends
|
||||
continue-on-error: true
|
||||
run: |
|
||||
brew update
|
||||
|
||||
- name: Build
|
||||
id: make_build
|
||||
run: |
|
||||
make
|
||||
|
||||
- name: Build runfast
|
||||
id: make_build_runfast
|
||||
run: |
|
||||
make runfast
|
||||
|
||||
- name: Build clang
|
||||
id: make_build_clang
|
||||
run: |
|
||||
make run CC=clang
|
||||
|
||||
windows-latest-make:
|
||||
runs-on: windows-latest
|
||||
|
||||
strategy:
|
||||
matrix:
|
||||
arch:
|
||||
- amd64
|
||||
- amd64_x86
|
||||
- amd64_arm64
|
||||
|
||||
steps:
|
||||
- name: Clone
|
||||
id: checkout
|
||||
uses: actions/checkout@v3
|
||||
|
||||
- name: Setup MSBuild
|
||||
uses: microsoft/setup-msbuild@v1
|
||||
|
||||
- name: Setup MSVC ${{ matrix.arch }}
|
||||
uses: ilammy/msvc-dev-cmd@v1
|
||||
with:
|
||||
arch: ${{ matrix.arch }}
|
||||
|
||||
- name: Build ${{ matrix.arch }}
|
||||
id: build_msvc
|
||||
run: |
|
||||
.\build_msvc.bat
|
||||
@@ -36,6 +36,15 @@ runomp: run.c
|
||||
win64:
|
||||
x86_64-w64-mingw32-gcc-win32 -Ofast -D_WIN32 -o run.exe -I. run.c win.c
|
||||
|
||||
# compiles with gnu99 standard flags for amazon linux, coreos, etc. compatibility
|
||||
.PHONY: rungnu
|
||||
rungnu:
|
||||
$(CC) -Ofast -std=gnu11 -o run run.c -lm
|
||||
|
||||
.PHONY: runompgnu
|
||||
runompgnu:
|
||||
$(CC) -Ofast -fopenmp -std=gnu11 run.c -lm -o run
|
||||
|
||||
.PHONY: clean
|
||||
clean:
|
||||
rm -f run
|
||||
|
||||
@@ -34,11 +34,20 @@ This still runs at interactive rates and samples more coherent and diverse stori
|
||||
|
||||
> Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.
|
||||
|
||||
You can also prompt the model with a prefix (sadly, because this is currently done via positional arguments, you also have to specify temperature 1.0 and 256 steps, before you enter the prompt):
|
||||
|
||||
```bash
|
||||
./run stories42M.bin 1.0 256 "One day, Lily met a Shoggoth"
|
||||
```
|
||||
|
||||
> One day, Lily met a Shoggoth. He was very shy, but was also very generous. Lily said “Hello Shoggy! Can I be your friend?” Shoggy was happy to have a friend and said “Yes, let’s explore the universe together!” So they set off on a journey to explore the universe. As they travelled, Shoggy was happy to explain to Lily about all the wonderful things in the universe. At the end of the day, Lily and Shoggy had gathered lots of wonderful things from the universe, and they both felt very proud. They promised to explore the universe as one big pair and to never stop being generous to each other.
|
||||
|
||||
There is also an even better 110M param model available, see [models](#models).
|
||||
|
||||
## Meta's Llama 2 models
|
||||
|
||||
As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format. For this we use the `export_meta_llama_bin.py` file, e.g. for 7B model:
|
||||
As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format.
|
||||
For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export_meta_llama_bin.py` file, e.g. for 7B model:
|
||||
|
||||
```bash
|
||||
python export_meta_llama_bin.py path/to/llama/model/7B llama2_7b.bin
|
||||
@@ -50,7 +59,7 @@ The export will take ~10 minutes or so and generate a 26GB file (the weights of
|
||||
./run llama2_7b.bin
|
||||
```
|
||||
|
||||
This ran at about 4 tokens/s compiled with OpenMP on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with `make runfast`.) Example output:
|
||||
This ran at about 4 tokens/s compiled with [OpenMP](#OpenMP) on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with `make runfast`.) Example output:
|
||||
|
||||
> The purpose of this document is to highlight the state-of-the-art of CoO generation technologies, both recent developments and those in commercial use. The focus is on the technologies with the highest merit to become the dominating processes of the future and therefore to be technologies of interest to S&T ... R&D. As such, CoO generation technologies developed in Russia, Japan and Europe are described in some depth. The document starts with an introduction to cobalt oxides as complex products and a short view on cobalt as an essential material. The document continues with the discussion of the available CoO generation processes with respect to energy and capital consumption as well as to environmental damage.
|
||||
|
||||
@@ -119,46 +128,44 @@ $ pytest
|
||||
|
||||
## performance
|
||||
|
||||
*(NOTE: this guide is not great because I personally spend a lot of my time in Python land and don't have an amazing understanding of a lot of these features and flags. If someone does and is willing to help document and briefly describe some of these and their tradeoffs, I'd welcome a PR)*
|
||||
|
||||
There are many ways to potentially speed up this code depending on your system. Here we document a few together with a high-level guide on what they do. Here's again the default way to compile, but using -O3:
|
||||
There are many ways to potentially speed up this code depending on your system. Have a look at the [Makefile](Makefile), which contains a lot of notes. The `make run` command currently uses the `-O3` optimization by default, i.e.:
|
||||
|
||||
```bash
|
||||
gcc -O3 -o run run.c -lm
|
||||
```
|
||||
|
||||
-O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches. Here's a few more to try.
|
||||
-O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches.
|
||||
|
||||
`-Ofast` Run additional optimizations which may break compliance with the C/IEEE specifications, in addition to `-O3`. See [the GCC docs](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) for more information.
|
||||
To get a much better performance, try to compile with `make runfast`. This turns on the `-Ofast` flag, which includes additional optimizations that may break compliance with the C/IEEE specifications, in addition to `-O3`. See [the GCC docs](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) for more information.
|
||||
|
||||
`-march=native` Compile the program to use the architecture of the machine you're compiling on rather than a more generic CPU. This may enable additional optimizations and hardware-specific tuning such as improved vector instructions/width.
|
||||
Try `-march=native` to compile the program to use the architecture of the machine you're compiling on rather than a more generic CPU. This may enable additional optimizations and hardware-specific tuning such as improved vector instructions/width.
|
||||
|
||||
The fastest throughput I saw so far on my MacBook Air (M1) is with:
|
||||
|
||||
```bash
|
||||
gcc -Ofast -o run run.c -lm
|
||||
```
|
||||
The fastest throughput I saw so far on my MacBook Air (M1) so far is with `make runfast`.
|
||||
|
||||
You can also experiment with replacing `gcc` with `clang`.
|
||||
|
||||
**OpenMP** Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention. You can compile e.g. like so:
|
||||
### OpenMP
|
||||
Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
|
||||
You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). I was not able to get improvements from OpenMP on my MacBook, though. Then you can compile with `make runomp`, which does:
|
||||
|
||||
```bash
|
||||
clang -Ofast -fopenmp -march=native run.c -lm -o run
|
||||
```
|
||||
|
||||
You can try swapping clang/gcc, and may try to leave out -march=native. However, when you run inference make sure to use OpenMP flags to set the number of threads, e.g.:
|
||||
When you run inference make sure to use OpenMP flags to set the number of threads, e.g.:
|
||||
|
||||
```bash
|
||||
OMP_NUM_THREADS=4 ./run out/model.bin
|
||||
```
|
||||
|
||||
Depending on your system resources you may want to tweak these hyperparameters. (TODO: I am not intimately familiar with OpenMP and its configuration, if someone would like to flesh out this section I would welcome a PR).
|
||||
Depending on your system resources you may want to tweak these hyperparameters and use more threads. But more is not always better, usually this is a bit U shaped.
|
||||
|
||||
## platforms
|
||||
|
||||
On **Windows**, use `build_msvc.bat` in a Visual Studio Command Prompt to build with msvc, or you can use `make win64` to use mingw compiler toolchain from linux or windows to build the windows target. MSVC build will automatically use openmp and max threads appropriate for your CPU unless you set `OMP_NUM_THREADS` env.
|
||||
|
||||
On **Centos 7**, **Amazon Linux 2018** use `rungnu` Makefile target: `make rungnu` or `make runompgnu` to use openmp.
|
||||
|
||||
## ack
|
||||
|
||||
I trained the llama2.c storyteller models on a 4X A100 40GB box graciously provided by the excellent [Lambda labs](https://lambdalabs.com/service/gpu-cloud), thank you.
|
||||
@@ -189,13 +196,19 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
|
||||
|
||||
- [llama2.rs](https://github.com/gaxler/llama2.rs) by @gaxler: a Rust port of this project
|
||||
- [go-llama2](https://github.com/tmc/go-llama2) by @tmc: a Go port of this project
|
||||
- [llama2.go](https://github.com/nikolaydubina/llama2.go) by @nikolaydubina: a Go port of this project
|
||||
- [llama2.go](https://github.com/haormj/llama2.go) by @haormj: a Go port of this project
|
||||
- [llama2.go](https://github.com/saracen/llama2.go) by @saracen: a Go port of this project
|
||||
- [llama2.c-android](https://github.com/Manuel030/llama2.c-android): by @Manuel030: adds Android binaries of this project
|
||||
- [llama2.cpp](https://github.com/leloykun/llama2.cpp) by @leloykun: a C++ port of this project
|
||||
|
||||
## unsorted todos
|
||||
|
||||
- why is there a leading space in C sampling code when we `./run`?
|
||||
- support Llama 2 Chat models, and tune run.c to Chat UI/UX
|
||||
- support Llama 2 7B Chat model and tune run.c to Chat UI/UX
|
||||
- speed up 7B Llama 2 models sufficiently to work at interactive rates on Apple Silicon MacBooks
|
||||
- possibly include emscripten / web backend (as seen in @gg PR)
|
||||
- currently the project only runs in fp32, want to explore more reduced precision inference.
|
||||
- currently the project only runs in fp32, how easy would it be to different precisions?
|
||||
- look into quantization and what would be involved
|
||||
- todo multiquery support? doesn't seem as useful for smaller models that run on CPU (?)
|
||||
- todo support inferencing beyond max_seq_len steps, have to think through the kv cache
|
||||
- why is MFU so low (~10%) on my A100 40GB for training?
|
||||
|
||||
@@ -193,6 +193,7 @@ void softmax(float* x, int size) {
|
||||
|
||||
void matmul(float* xout, float* x, float* w, int n, int d) {
|
||||
// W (d,n) @ x (n,) -> xout (d,)
|
||||
// by far the most amount of time is spent inside this little function
|
||||
int i;
|
||||
#pragma omp parallel for private(i)
|
||||
for (i = 0; i < d; i++) {
|
||||
@@ -205,7 +206,7 @@ void matmul(float* xout, float* x, float* w, int n, int d) {
|
||||
}
|
||||
|
||||
void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights* w) {
|
||||
|
||||
|
||||
// a few convenience variables
|
||||
float *x = s->x;
|
||||
int dim = p->dim;
|
||||
@@ -222,7 +223,7 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
|
||||
|
||||
// forward all the layers
|
||||
for(int l = 0; l < p->n_layers; l++) {
|
||||
|
||||
|
||||
// attention rmsnorm
|
||||
rmsnorm(s->xb, x, w->rms_att_weight + l*dim, dim);
|
||||
|
||||
@@ -316,7 +317,7 @@ void transformer(int token, int pos, Config* p, RunState* s, TransformerWeights*
|
||||
for (int i = 0; i < hidden_dim; i++) {
|
||||
s->hb[i] = s->hb[i] * (1.0f / (1.0f + expf(-s->hb[i])));
|
||||
}
|
||||
|
||||
|
||||
// elementwise multiply with w3(x)
|
||||
for (int i = 0; i < hidden_dim; i++) {
|
||||
s->hb[i] = s->hb[i] * s->hb2[i];
|
||||
@@ -347,6 +348,10 @@ unsigned int random_u32() {
|
||||
float random_f32() {
|
||||
return (random_u32() >> 8) / 16777216.0f;
|
||||
}
|
||||
|
||||
// ----------------------------------------------------------------------------
|
||||
// functions to sample the next token from the transformer's predicted distribution
|
||||
|
||||
int sample(float* probabilities, int n) {
|
||||
// sample index from probabilities, they must sum to 1
|
||||
float r = random_f32();
|
||||
@@ -372,20 +377,76 @@ int argmax(float* v, int n) {
|
||||
}
|
||||
return max_i;
|
||||
}
|
||||
// ----------------------------------------------------------------------------
|
||||
// byte pair encoding (BPE) tokenizer, encodes strings into tokens so we can prompt
|
||||
|
||||
int str_lookup(char *str, char **vocab, int vocab_size) {
|
||||
// find the first perfect match for str in vocab, return its index or -1 if not found
|
||||
for (int i = 0; i < vocab_size; i++) {
|
||||
if (strcmp(str, vocab[i]) == 0) {
|
||||
return i;
|
||||
}
|
||||
}
|
||||
return -1;
|
||||
}
|
||||
|
||||
void bpe_encode(char *text, char **vocab, float *vocab_scores, int vocab_size, unsigned int max_token_length, int *tokens, int *n_tokens) {
|
||||
|
||||
// a temporary buffer to merge two consecutive tokens
|
||||
char* str_buffer = malloc((max_token_length*2+1) * sizeof(char)); // *2 for concat, +1 for null terminator
|
||||
|
||||
// first encode every individual byte in the input string
|
||||
*n_tokens = 0; // the number of tokens
|
||||
for (char *c = text; *c != '\0'; c++) {
|
||||
sprintf(str_buffer, "%c", *c);
|
||||
int id = str_lookup(str_buffer, vocab, vocab_size);
|
||||
if (id == -1) { printf("not good\n"); exit(1);}
|
||||
tokens[*n_tokens] = id;
|
||||
(*n_tokens)++;
|
||||
}
|
||||
|
||||
// merge the best consecutive pair each iteration, according the scores in vocab_scores
|
||||
while (1) {
|
||||
float best_score = -1e10;
|
||||
int best_id = -1;
|
||||
int best_idx = -1;
|
||||
|
||||
for (int i=0; i < (*n_tokens-1); i++) {
|
||||
// check if we can merge the pair (tokens[i], tokens[i+1])
|
||||
sprintf(str_buffer, "%s%s", vocab[tokens[i]], vocab[tokens[i+1]]);
|
||||
int id = str_lookup(str_buffer, vocab, vocab_size);
|
||||
if (id != -1 && vocab_scores[id] > best_score) {
|
||||
// this merge pair exists in vocab! record its score and position
|
||||
best_score = vocab_scores[id];
|
||||
best_id = id;
|
||||
best_idx = i;
|
||||
}
|
||||
}
|
||||
|
||||
if (best_idx == -1) {
|
||||
break; // we couldn't find any more pairs to merge, so we're done
|
||||
}
|
||||
|
||||
// merge the consecutive pair (best_idx, best_idx+1) into new token best_id
|
||||
tokens[best_idx] = best_id;
|
||||
// delete token at position best_idx+1, shift the entire sequence back 1
|
||||
for (int i = best_idx+1; i < (*n_tokens-1); i++) {
|
||||
tokens[i] = tokens[i+1];
|
||||
}
|
||||
(*n_tokens)--; // token length decreased
|
||||
}
|
||||
|
||||
free(str_buffer);
|
||||
}
|
||||
|
||||
// ----------------------------------------------------------------------------
|
||||
|
||||
// utilities
|
||||
long time_in_ms() {
|
||||
#if defined _WIN32
|
||||
// windows specific way to get time
|
||||
return GetTickCount();
|
||||
#else
|
||||
// linux specific way to get time
|
||||
struct timespec time;
|
||||
clock_gettime(CLOCK_REALTIME, &time);
|
||||
return time.tv_sec * 1000 + time.tv_nsec / 1000000;
|
||||
#endif
|
||||
}
|
||||
// ----------------------------------------------------------------------------
|
||||
|
||||
int main(int argc, char *argv[]) {
|
||||
|
||||
@@ -393,9 +454,11 @@ int main(int argc, char *argv[]) {
|
||||
char *checkpoint = NULL; // e.g. out/model.bin
|
||||
float temperature = 0.9f; // e.g. 1.0, or 0.0
|
||||
int steps = 256; // max number of steps to run for, 0: use seq_len
|
||||
char *prompt = NULL; // prompt string
|
||||
|
||||
// 'checkpoint' is necessary arg
|
||||
if (argc < 2) {
|
||||
printf("Usage: %s <checkpoint_file> [temperature] [steps]\n", argv[0]);
|
||||
printf("Usage: %s <checkpoint_file> [temperature] [steps] [prompt]\n", argv[0]);
|
||||
return 1;
|
||||
}
|
||||
if (argc >= 2) {
|
||||
@@ -408,6 +471,9 @@ int main(int argc, char *argv[]) {
|
||||
if (argc >= 4) {
|
||||
steps = atoi(argv[3]);
|
||||
}
|
||||
if (argc >= 5) {
|
||||
prompt = argv[4];
|
||||
}
|
||||
|
||||
// seed rng with time. if you want deterministic behavior use temperature 0.0
|
||||
rng_seed = (unsigned int)time(NULL);
|
||||
@@ -415,17 +481,14 @@ int main(int argc, char *argv[]) {
|
||||
// read in the model.bin file
|
||||
Config config;
|
||||
TransformerWeights weights;
|
||||
int fd = 0;
|
||||
float* data = NULL;
|
||||
long file_size;
|
||||
int fd = 0; // file descriptor for memory mapping
|
||||
float* data = NULL; // memory mapped data pointer
|
||||
long file_size; // size of the checkpoint file in bytes
|
||||
{
|
||||
FILE *file = fopen(checkpoint, "rb");
|
||||
if (!file) {
|
||||
printf("Unable to open the checkpoint file %s!\n", checkpoint);
|
||||
return 1;
|
||||
}
|
||||
if (!file) { printf("Couldn't open file %s\n", checkpoint); return 1; }
|
||||
// read in the config header
|
||||
if(fread(&config, sizeof(Config), 1, file) != 1) { return 1; }
|
||||
if (fread(&config, sizeof(Config), 1, file) != 1) { return 1; }
|
||||
// negative vocab size is hacky way of signaling unshared weights. bit yikes.
|
||||
int shared_weights = config.vocab_size > 0 ? 1 : 0;
|
||||
config.vocab_size = abs(config.vocab_size);
|
||||
@@ -446,18 +509,18 @@ int main(int argc, char *argv[]) {
|
||||
|
||||
// read in the tokenizer.bin file
|
||||
char** vocab = (char**)malloc(config.vocab_size * sizeof(char*));
|
||||
float* vocab_scores = (float*)malloc(config.vocab_size * sizeof(float));
|
||||
unsigned int max_token_length;
|
||||
{
|
||||
FILE *file = fopen("tokenizer.bin", "rb");
|
||||
if (!file) {
|
||||
printf("Unable to open the tokenizer file tokenizer.bin! Run "
|
||||
"python tokenizer.py to convert tokenizer.model -> tokenizer.bin\n");
|
||||
return 1;
|
||||
}
|
||||
if (!file) { printf("couldn't load tokenizer.bin\n"); return 1; }
|
||||
if (fread(&max_token_length, sizeof(int), 1, file) != 1) { printf("failed read\n"); return 1; }
|
||||
int len;
|
||||
for (int i = 0; i < config.vocab_size; i++) {
|
||||
if(fread(&len, sizeof(int), 1, file) != 1) { return 1; }
|
||||
if (fread(vocab_scores + i, sizeof(float), 1, file) != 1) { printf("failed read\n"); return 1;}
|
||||
if (fread(&len, sizeof(int), 1, file) != 1) { printf("failed read\n"); return 1; }
|
||||
vocab[i] = (char *)malloc(len + 1);
|
||||
if(fread(vocab[i], len, 1, file) != 1) { return 1; }
|
||||
if (fread(vocab[i], len, 1, file) != 1) { printf("failed read\n"); return 1; }
|
||||
vocab[i][len] = '\0'; // add the string terminating token
|
||||
}
|
||||
fclose(file);
|
||||
@@ -466,46 +529,66 @@ int main(int argc, char *argv[]) {
|
||||
// create and init the application RunState
|
||||
RunState state;
|
||||
malloc_run_state(&state, &config);
|
||||
|
||||
// the current position we are in
|
||||
long start = time_in_ms();
|
||||
int next;
|
||||
int token = 1; // 1 = BOS token in Llama-2 sentencepiece
|
||||
int pos = 0;
|
||||
printf("<s>\n"); // explicit print the initial BOS token (=1), stylistically symmetric
|
||||
|
||||
// process the prompt, if any
|
||||
int *prompt_tokens = NULL;
|
||||
int num_prompt_tokens = 0;
|
||||
if (prompt != NULL) {
|
||||
prompt_tokens = (int*)malloc(config.seq_len * sizeof(int));
|
||||
bpe_encode(prompt, vocab, vocab_scores, config.vocab_size, max_token_length, prompt_tokens, &num_prompt_tokens);
|
||||
}
|
||||
|
||||
// start the main loop
|
||||
long start = 0; // used to time our code, only initialized after first iteration
|
||||
int next; // will store the next token in the sequence
|
||||
int token = 1; // init with token 1 (=BOS), as done in Llama-2 sentencepiece tokenizer
|
||||
int pos = 0; // position in the sequence
|
||||
printf("<s>\n"); // explicit print the initial BOS token for stylistic symmetry reasons
|
||||
while (pos < steps) {
|
||||
|
||||
// forward the transformer to get logits for the next token
|
||||
transformer(token, pos, &config, &state, &weights);
|
||||
|
||||
// sample the next token
|
||||
if(temperature == 0.0f) {
|
||||
// greedy argmax sampling
|
||||
next = argmax(state.logits, config.vocab_size);
|
||||
if(pos < num_prompt_tokens) {
|
||||
// if we are still processing the input prompt, force the next prompt token
|
||||
next = prompt_tokens[pos];
|
||||
} else {
|
||||
// apply the temperature to the logits
|
||||
for (int q=0; q<config.vocab_size; q++) { state.logits[q] /= temperature; }
|
||||
// apply softmax to the logits to get the probabilities for next token
|
||||
softmax(state.logits, config.vocab_size);
|
||||
// we now want to sample from this distribution to get the next token
|
||||
next = sample(state.logits, config.vocab_size);
|
||||
// sample the next token
|
||||
if (temperature == 0.0f) {
|
||||
// greedy argmax sampling: take the token with the highest probability
|
||||
next = argmax(state.logits, config.vocab_size);
|
||||
} else {
|
||||
// apply the temperature to the logits
|
||||
for (int q=0; q<config.vocab_size; q++) { state.logits[q] /= temperature; }
|
||||
// apply softmax to the logits to get the probabilities for next token
|
||||
softmax(state.logits, config.vocab_size);
|
||||
// we sample from this distribution to get the next token
|
||||
next = sample(state.logits, config.vocab_size);
|
||||
}
|
||||
}
|
||||
printf("%s", vocab[next]);
|
||||
|
||||
// following BOS token (1), sentencepiece decoder strips any leading whitespace (see PR #89)
|
||||
char *token_str = (token == 1 && vocab[next][0] == ' ') ? vocab[next]+1 : vocab[next];
|
||||
printf("%s", token_str);
|
||||
fflush(stdout);
|
||||
|
||||
// advance forward
|
||||
token = next;
|
||||
pos++;
|
||||
// init our timer here because the first iteration is slow due to memmap
|
||||
if (start == 0) { start = time_in_ms(); }
|
||||
}
|
||||
|
||||
// report achieved tok/s
|
||||
long end = time_in_ms();
|
||||
printf("\nachieved tok/s: %f\n", steps / (double)(end-start)*1000);
|
||||
printf("\nachieved tok/s: %f\n", (steps-1) / (double)(end-start)*1000);
|
||||
|
||||
// memory and file handles cleanup
|
||||
free_run_state(&state);
|
||||
for (int i = 0; i < config.vocab_size; i++) { free(vocab[i]); }
|
||||
free(vocab);
|
||||
free(vocab_scores);
|
||||
if (prompt_tokens != NULL) free(prompt_tokens);
|
||||
if (data != MAP_FAILED) munmap(data, file_size);
|
||||
if (fd != -1) close(fd);
|
||||
return 0;
|
||||
|
||||
Binary file not shown.
+17
-7
@@ -3,6 +3,7 @@
|
||||
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
|
||||
|
||||
import os
|
||||
import struct
|
||||
from logging import getLogger
|
||||
from typing import List
|
||||
|
||||
@@ -39,26 +40,35 @@ class Tokenizer:
|
||||
return self.sp_model.decode(t)
|
||||
|
||||
def export(self):
|
||||
tokens = []
|
||||
|
||||
# get all the tokens (postprocessed) and their scores as floats
|
||||
tokens, scores = [], []
|
||||
for i in range(self.n_words):
|
||||
|
||||
# decode the token and light postprocessing
|
||||
t = self.sp_model.id_to_piece(i)
|
||||
s = self.sp_model.get_score(i)
|
||||
if i == self.bos_id:
|
||||
t = '\n<s>\n'
|
||||
elif i == self.eos_id:
|
||||
t = '\n</s>\n'
|
||||
elif len(t) == 6 and t.startswith('<0x') and t.endswith('>'):
|
||||
t = chr(int(t[3:5], 16)) # e.g. make '<0x01>' into '\x01'
|
||||
t = t.replace('▁', ' ') # sentencepiece uses this as the whitespace
|
||||
t = t.replace('▁', ' ') # sentencepiece uses this character as whitespace
|
||||
b = t.encode('utf-8') # bytes of this token, utf-8 encoded
|
||||
|
||||
tokens.append(t)
|
||||
tokens.append(b)
|
||||
scores.append(s)
|
||||
|
||||
# record the max token length
|
||||
max_token_length = max(len(t) for t in tokens)
|
||||
|
||||
# write to a binary file
|
||||
with open(TOKENIZER_BIN, 'wb') as f:
|
||||
for token in tokens:
|
||||
bytes = token.encode('utf-8')
|
||||
f.write((len(bytes)).to_bytes(4, 'little')) # write length of bytes
|
||||
f.write(bytes) # write token bytes
|
||||
f.write(struct.pack("I", max_token_length))
|
||||
for bytes, score in zip(tokens, scores):
|
||||
f.write(struct.pack("fI", score, len(bytes)))
|
||||
f.write(bytes)
|
||||
|
||||
if __name__ == "__main__":
|
||||
t = Tokenizer()
|
||||
|
||||
@@ -142,7 +142,7 @@ model_args = dict(
|
||||
vocab_size=32000,
|
||||
multiple_of=multiple_of,
|
||||
max_seq_len=max_seq_len,
|
||||
#dropout=dropout,
|
||||
dropout=dropout,
|
||||
) # start with model_args from command line
|
||||
if init_from == "scratch":
|
||||
# init a new model from scratch
|
||||
@@ -179,7 +179,7 @@ scaler = torch.cuda.amp.GradScaler(enabled=(dtype == "float16"))
|
||||
|
||||
# optimizer
|
||||
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
|
||||
if init_from == "resume":
|
||||
if init_from == "resume" and "optimizer" in checkpoint:
|
||||
optimizer.load_state_dict(checkpoint["optimizer"])
|
||||
checkpoint = None # free up memory
|
||||
|
||||
|
||||
@@ -176,3 +176,11 @@ int munlock(const void *addr, size_t len)
|
||||
|
||||
return -1;
|
||||
}
|
||||
|
||||
// Portable clock_gettime function for Windows
|
||||
int clock_gettime(int clk_id, struct timespec *tp) {
|
||||
DWORD ticks = GetTickCount();
|
||||
tp->tv_sec = ticks / 1000;
|
||||
tp->tv_nsec = (ticks % 1000) * 1000000;
|
||||
return 0;
|
||||
}
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
|
||||
#define WIN32_LEAN_AND_MEAN // Exclude rarely-used stuff from Windows headers
|
||||
#include <windows.h>
|
||||
#include <time.h>
|
||||
|
||||
|
||||
// Below code is originally from mman-win32
|
||||
@@ -12,9 +13,9 @@
|
||||
* mman-win32
|
||||
*/
|
||||
|
||||
#ifndef _WIN32_WINNT // Allow use of features specific to Windows XP or later.
|
||||
#define _WIN32_WINNT 0x0501 // Change this to the appropriate value to target other versions of Windows.
|
||||
#endif
|
||||
#ifndef _WIN32_WINNT // Allow use of features specific to Windows XP or later.
|
||||
#define _WIN32_WINNT 0x0501 // Change this to the appropriate value to target other versions of Windows.
|
||||
#endif
|
||||
|
||||
/* All the headers include this file. */
|
||||
#ifndef _MSC_VER
|
||||
@@ -47,12 +48,16 @@ extern "C" {
|
||||
#define MS_SYNC 2
|
||||
#define MS_INVALIDATE 4
|
||||
|
||||
/* Flags for portable clock_gettime call. */
|
||||
#define CLOCK_REALTIME 0
|
||||
|
||||
void* mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);
|
||||
int munmap(void *addr, size_t len);
|
||||
int mprotect(void *addr, size_t len, int prot);
|
||||
int msync(void *addr, size_t len, int flags);
|
||||
int mlock(const void *addr, size_t len);
|
||||
int munlock(const void *addr, size_t len);
|
||||
int clock_gettime(int clk_id, struct timespec *tp);
|
||||
|
||||
#ifdef __cplusplus
|
||||
};
|
||||
|
||||
Reference in New Issue
Block a user