This guide will explain how to set up everything in Windows to run new Meta Llama2 70B model on your local computer without WebUI or WSL needed.
- CUDA capable computer (NVIDIA's graphics card).
- NVIDIA RTX 3070 or higher recommended (I'm using this one, and works right on the edge).
- 8GB VRAM (determined by the graphics card)
- 12GB RAM at least
- Some Command Line Skills & Patience
Skip this step if already installed. This toolkit is necessary to harness the full potential of your computer. Trying to run Llama2 on CPU barely works. All the instalation guide can be found in this CUDA Guide. However here is a summary of the process:
- Check the compatibility of your NVIDIA graphics card with CUDA.
- Update the drivers for your NVIDIA graphics card.
- Download the CUDA Toolkit installer from the NVIDIA official website.
- Run the CUDA Toolkit installer.
- Make sure the environment variables are set (specifically PATH).
- Restart your computer.
Once it is installed in your computer verify the installation running nvcc --version
in PowerShell. It should appear some info like this:
Ensure you have previously installed Python and pip. Then download all the necessary libraries using the terminal:
# Base Dependencies
pip install transformers torch yaml
Llama2 isn't often used directly, so it is also necesary to integrate 4bit-optimization into the model. For this we must use bitsandbytes, however currently (v0.41.0) it has only CUDA support on Linux, so we will need to install a precompiled wheel in Windows. For this follow the next steps:
- Check your CUDA version using
nvcc --version
- Download your wheel from this repository (thanks to jllllll) Note: Must use at least >= 0.39.1 to work with 4bit-optimization
- Install the library
# Go to your wheel directory
cd (path-to-download)
# Replace with your selected wheel
pip install bitsandbytes-0.41.1-py3-none-win_amd64.whl
# Check it has been succesfully installed
pip show bitsandbytes
# Check if it has been compiled with CUDA support. Some versions can fail, however it is
# only necesary that the warning "This version has not been compiled with CUDA" DO NOT pop up
# (even if it crashes some lines afterwards there is no problem)
python -m bitsandbytes
Finally with bitsandbytes installed we will also add the accelerate library to optimice the model
pip install accelerate
- First, we need to create an accout into the Hugging Face page and get our access token to load the model in the computer. You can follow this guide but is as simple as going to Settings > Access Tokens > New Token > Write.
- With the same email as the used in Hugging Face we must request access to the model to Meta in AI.meta.com.
- Go to Hugging Face, log into your account and select one of the three Llama2 open source models. Then request access to them in this link. When your access has been granted (1-2h) you'll receive an email and also the site will update to be fully enabled. Then you can look at all the models in HuggingFace meta-llama models
You now have everything you needed to run the Llama2 Model on your GPU. You can test this installation using the scripts added to this repository:
# Get the repository and fill the data
git clone https://github.com/SamthinkGit/llama2-for-windows.git
cd .\llama2-for-windows
python.exe .\setup.py
# Llama2 only-terminal mode
python.exe .\llama2.py
# Llama2 Web based Interface
pip install zmq streamlit
python.exe .\llama2-web.py
# Don't forget to give a star to this repo if it worked for u :D
If you want to build your own code for Llama2 or more purpouses continue with the guide.
We will now write our first code to make Llama2 talk and answer some questions. We wont go in much further detail since this is not a LLM course. The code adapted from Haystack-Llama2-Guide from anakin87
- First, create a new
config.yaml
file where we will write some information of the model. The<BATCH_SIZE>
will determine how your GPU uses your memory. I use a value of 25 and it's fine but it can be up to 100 or more if you are fine with it. The model selected is listed in the yaml, I recommend using the 7b one for the minimum computer requisites. The 13b will need around 12GB VRAM to work fine and 24GB for the 70b model. Remember to fill<YOUR_HUGGING_FACE_TOKEN>
with the key obtained few steps ago.
# -> powershell
New-Item config.yaml
notepad.exe config.yaml
# -> config.yaml
# Possible llama models:
# Llama-2-7b-hf
# Llama-2-7b-chat-hf
# Llama-2-13b-hf
# Llama-2-13b-chat-hf
# Llama-2-70b-hf
# Llama-2-70b-chat-hf
general:
logging_level: WARNING
pytorch_cuda_config: max_split_size_mb:<BATCH_SIZE>
model:
token: <YOUR_HUGGING_FACE_TOKEN>
id: meta-llama/Llama-2-7b-chat-hf
- Create a Python file
model.py
with your IDE or by using the notepad.exe in some dir. Then initialice the model settings written in the yaml.
import logging
import torch
import os
import yaml
from transformers import AutoModelForCausalLM, AutoTokenizer
# Open config file
with open("config.yaml", 'r') as stream:
config = yaml.safe_load(stream)
# Some logs for the errors
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=config['general']['logging_level'])
# Initializing
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = config['general']['pytorch_cuda_config']
torch.cuda.empty_cache() # Clean the cache, recommended for low GPU's
# Obtaining some variables
hf_token = config['model']['token']
model_id = config['model']['id']
- Finally we just need to build the model and write a query
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, use_auth_token=hf_token)
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_token)
while True:
torch.cuda.empty_cache()
input_text = input("Query: ")
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_length=100) # You can modify the length of the answer
print(tokenizer.decode(outputs[0]))
python.exe .\model.py
Loading checkpoint shards: 100%|█████████████████████████████████████| 2/2 [00:15<00:00, 7.52s/it]
Question: Explain briefly how to play Dark Souls I
[AI] <s> Explain briefly how to play Dark Souls I and II in multiplayer.
Hopefully this will help you get started with playing Dark Souls in multiplayer.
Dark Souls is a challenging action RPG with a unique sense of...
We will now use streamlit and zmq to build a small server for our model to talk with the user. We can make it all in one script, however it is better to optimice it in two parallel executions releasing a bit the GPU from the loading of the web.
- Add new socket parameters into the
config.yaml
file
# Select the a free port, in this case 12443 is fine
socket:
REQ: tcp://localhost:12443
REP: tcp://*:12443
- Create 2 new files,
model-back.py
andmodel-front.py
. In the back we will copy the content ofmodel.py
and modify the code to accept queries from a server. In the front we will simply request a query to the user and print the generation in the screen.
#model-back.py
import zmq
# Same imports...
# Initialize all variables (step 2 model.py)...
# Build the model
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, use_auth_token=hf_token)
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_token)
# Function to generate a output from a query
def get_answer(input_text):
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_length = 100)
return tokenizer.decode(outputs[0])
# Building Socket
context = zmq.Context()
socket = context.socket(zmq.REP)
print(f"Wating for connection...")
socket.bind(config['socket']['REP'])
print(f"[SUCCESS] Connection Stablished")
# New loop for receiving queries
while True:
input_text = socket.recv_string()
print(f"Generating Output...")
output = get_answer(input_text)
print(f"Generation Finished")
torch.cuda.empty_cache()
socket.send_string(output)
# model-front.py
import streamlit as st
import zmq
import yaml
# Open yaml
with open("config.yaml", 'r') as stream:
config = yaml.safe_load(stream)
# Build socket
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect(config['socket']['REQ'])
# Build interface
st.title('[ Talk with Meta-Llama2 ]')
user_input = st.text_input("Input:")
# Write the query
if st.button("Send"):
socket.send_string(user_input)
st.write(socket.recv_string())
- Now you only need to launch this 2 scripts to have Llama2 fully working!
python.exe model-back.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.75s/it]
Wating for connection...
[SUCCESS] Connection Stablished
...
# In other terminal
streamlit run model-front.py
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://192.168.1.128:8501