You like makemore? You like llm.c? You love parallelisms
!
parallelisms
is an educational library which implements SOTA LLM parallelisms in pure C.
It aims to have a simple, spelled out implementations for maximum clarity. Currently supports:
- Data parallel [code]
- Fully-sharded data parallel (FSDP) [code]
- Tensor parallel [code]
- Pipeline parallel [code]
- Advanced 3D Parallelism: combining FSDP, tensor, and pipeline parallelism [code]
While everything runs on CPU using MPI, the key ideas illustrated here are exactly how SOTA LLMs (like Llama 3) are trained.
The only dependency is OpenMPI which can be downloaded here (but any MPI implementation will work). Once you've installed
OpenMPI, you can compile and run any of the training scripts, for example train_fsdp.c
:
mpicc -Ofast train_fsdp.c && mpirun -n 4 a.out
Here -n
specifies the number of FSDP shards. If you don't have enough cores for the desired parallelism level, you can tell OpenMPI to oversubscribe the cores. For example, here
is how I run 3d parallelism on my 8 core MacBook Air:
mpicc -Ofast train_3d.c && mpirun -n 24 --map-by=:oversubscribe a.out --tp 4 --dp 2
This will tell OpenMPI to run with 24 shards and the training script will use tensor parallelism of 4, (fully sharded) data parallelism of 2, and pipeline parallelism of 3 (this is always fixed).
The training scripts will all train a character-level language model on a names dataset (similar to makemore). At the end of training, 10 new names are generated, for example:
Final validation loss: 2.343800
. . . . . . . . . . . . a n s i --> .
. . . . . . . . f r e s s i n i --> .
. . . . . . . . . . m a y d r a --> .
. . . . . . . . . . m e r g i n --> .
. . . . . . . . p l a y o n n g --> .
. . . . . . . . . . j e n i c z --> .
. . . . . . . . . . . . e a i s --> .
. . . . . . . . . . . m e r m y --> .
. . . . . . . . . m i l e s e n --> .
. . . . . . . . . . s h a b e r --> .
All randomness is seeded and the models are initialized identically across all training scripts so the generations should all be identical in theory. In practice however, the trained models do diverge slightly and produce slightly different generations. This is likley due to the non-commutativity of floating point operations arising from MPI messages arriving in different orders.
The most interesting files to look at are the train_*.c
files. train.c
contains the reference implementation for single-threaded training
while the rest implement the individual parallelism methods. Finally, train_3d.c
brings everything together to implement 3d parallel training.
The individual parallelism implementations are modular enough that 3d parallelism "falls out".
The rest of the code implements an MLP with hand-crafted forward, backward passes and a data loader.
- The forward/backward functions have the exact same signature. This makes calling the backwards pass convenient as it's just a mirrored version of the forward pass.
- C only supports
structs
andfunctions
(no classes) so "methods" are implemented by prefixing functions by the name of the struct they "belong" and passing the struct asself
. - Design tradeoffs were made to favor clarity over performance. For example, each script creates the full model in memory before sharding it to ensure initialization is identical across scripts. This is wasteful in practice and requires much more memoryh than is necessary.