Skip to content

A fork of llama3.c used to do some R&D on inferencing

License

Notifications You must be signed in to change notification settings

ryao/llama3.c

 
 

Repository files navigation

llama3.c fork

This fork is currently meant to be compiled using the Intel MKL:

$ gcc -DUSE_MKL -ggdb3 -Ofast -march=native -I"${MKLROOT}/include" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -lm -ldl -lm -lmvec run.c

It is for doing some R&D involving inferencing. Interestingly, it runs prompt processing for inferencing about 15% faster than llama.cpp when using FP32 on the Ryzen 7 5800X when using a 340 token test prompt:

llama.cpp performance:

pp tok/s: 46.14
tg tok/s: 1.29

This fork's performance:

pp tok/s: 53.70
tg tok/s: 1.33

llama3.c - A faithful clone of Karpathy's llama2.c but fully functional with LLaMA 3 8B base and instruct models.

See Andrej Karpathy's repo for the real deal built for llama2.c architecture and many other cool models he has built.

Cute Llama

Have you ever wanted to inference a baby Llama 3 model in pure C? No? Well, now you can!

Run LLaMA 3 8B models with one simple 700-line C file (run.c).

The current code inferences models in both fp32 and int8 (see below).

Please note that this repo is a modificaion of Andrej Karpathy's llama2.c but changing the hard coding to work with the modified-tiktoken tokenization used by the suite of Meta LLaMA 3 models.

getting started

First, navigate to the folder where you keep your projects and clone this repository to this folder:

git clone https://github.com/jameswdelancey/llama3.c.git

Then, open the repository folder:

cd llama3.c

which model do i download? base or instruct

  • If you do not know, go with the instruct model. It will work with both the single shot "generate" mode and the "chat" mode of llama3.c.
  • The "chat" mode of llama3.c only supports the instruct model and will surely not work with base model. You can try it for fun and learning at your own risk :).

Download LLaMA 3 8B base and/or instruct. The huggingface site works. You'll need to sign up and get approved. Specifically download the original directory.

https://huggingface.co/meta-llama/Meta-Llama-3-8B
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

When downloading these models I did have to rename the original-params.json to params.json for the export.py to work.

mv /d/llama3-8b-instruct/original_params.json /d/llama3-8b-instruct/params.json

compile and run the C code:

gcc -Ofast run.c -o run
./run.exe "llama3_8b_instruct.bin" -z "tokenizer_llama3.bin" -m chat
./run.exe "llama3_8b_instruct.bin" -z "../dev/tokenizer_llama3.bin" -i "Once upon a time"

high performance

  • fopenmp If you have these libraries you can run the model much faster. I'm running an Intel i3 14th gen and get 1.9 tok/s with openmp
  • march=native This is required for gcc or clang to use SIMD intrinsics and will speed up your runs.
  • win.c This is optional unless you're on Windows. I'm compiling with MINGW64 and it works well.
  • gcc or clang, both work well and I get very close results between the two.
$ gcc -Ofast -fopenmp -march=native run.c win.c -o run

base model single shot

This still runs at interactive rates and samples more coherent and diverse stories:

./run "llama3_8b_base.bin" -z "../dev/tokenizer_llama3.bin" -n 50 -i "Once upon a time"

Once upon a time, a girl named Lily who had grown up in a world where people werenΓÇÖt allowed to live happily ever after. One day, Lily decided to take a chance and do something that was a little bit crazy, something that she had

You can also prompt the model with a prefix or a number of additional command line arguments, e.g. to sample at temperature 0.8 for 256 steps and with a prompt:

./run "llama3_8b_base.bin" -z "../dev/tokenizer_llama3.bin" -t 0.8 -n 256 -i "One day, Lily met a Shoggoth"

One day, Lily met a Shoggoth named Lurking. She had never seen a Shoggoth before, but she was immediately impressed by its size and ferocity. They struck up a conversation, and Lurking told her about the ShoggothsΓÇÖ origins and their place in the universe. Lily was fascinated by what Lurking told her, and she learned a lot about the Shoggoths that day. She also made a new friend in Lurking.

instruct model, chat

You're very likely to get results that you like better from the instruct model. You must use it with the chat mode as the token setup (preamble) is totally different. Here's an example:

./run "llama3_8b_instruct.bin" -z "../dev/tokenizer_llama3.bin" -m chat

Enter system prompt (optional): You are a professional software programmer with 20 years experience and an IQ of 200 and you love helping people.

User: Give me a terse hello world in Rust

Assistant: A delightful request!

Here is a terse "Hello, World!" program in Rust:

fn main() {
    println!("Hello, World!");
}

This program defines a main function that calls the println! macro to print the string "Hello, World!" followed by a newline character.

That's it! You can run this program using the rustc compiler and then executing the resulting executable file.

User:

Meta's Llama 3 models

Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 3 checkpoints by following the Meta instructions. Once we have those checkpoints, we have to convert them into the llama3.c format. For this we need to install the python dependencies (pip install -r requirements.txt) and then use the export.py file, e.g. for 8B model:

python export.py llama3_8b_base.bin --meta-llama ../llama3-8b-base/

The export will take ~10 minutes or so and generate a 31GB file (the weights of the 8B model in float32) called llama3_8b.bin in the current directory. Once the export is done, we can run it:

./run "llama3_8b_base.bin" -z "../dev/tokenizer_llama3.bin"

This ran at about 2 tokens/s compiled with OpenMP on 8 threads on my Intel i3 14th gen. Example output:

./run "llama3_8b_base.bin" -z "../dev/tokenizer_llama3.bin"

Question: What is the second derivative of 2p**3 - 4p**2 - 12p? Answer: 12p - 8

base models... ¯\(ツ)/¯. Since we can inference the base model, it should be possible to also inference the instruct model quite easily, and have a conversation with it. And if we can find a way to run 7B more efficiently, we can start adding LoRA to our training script, and going wild with finetunes all within the repo!

You can also chat with the Llama Chat models. Export the chat model exactly as above:

python export.py llama3_8b_instruct.bin --meta-llama ../llama3-8b-instruct/

Then chat with it by specifying the chat mode using the -m flag, e.g.:

./run "llama3_8b_instruct.bin" -z "../dev/tokenizer_llama3.bin" -m chat

Int8 Quantization

Compile the quantized version of the runtime:

gcc -Ofast -fopenmp -march=native runq.c win.c -o runq

Export a quantized version of the model. It is about 8GB vs 31.

python export.py llama3_8b_instruct_q80.bin --meta-llama ../llama3-8b-base/ --version 2

The export will take ~10 minutes or so. Once the export is done, we can run it:

./runq "llama3_8b_instruct_q80.bin" -z "../dev/tokenizer_llama3.bin"

This ran at about 4 tokens/s compiled with OpenMP on 8 threads on my Intel i3 14th gen. Example output:

License

MIT

About

A fork of llama3.c used to do some R&D on inferencing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 80.3%
  • Python 17.3%
  • Cuda 2.4%