Has important links, tools, blogs or concepts that i learnt during EMLO-4.0
- gitpod.io
- github code spaces
- jarvislabs
- runpod
- papersapce
- AWS spot instances
- Use NVIDIA Time slicing for Low end GPUS
- Use NVIDIA MIG for high end GPUs
- In aws resouces are region specific and high end gpus are available only in US region
Instances
- `T3a` -> 'a' means amd
- `M8g` -> arm processor and g1 are less charged
- `M7i` -> 'i' means intel
- `t3.micro`, `t3.nano` is free tier
- `p4de.24xlarge` - 70 billion model training -> And will be allocated only on request
- EBS instance are costly as input and output operations ex writing a model -
p4de.large
instance - L4 successor of T4 GPUs
- Minimum instance -
t3a.medium
-> 2vCpu -> 4GB mem -> mnist traninig - Highest instance -
p5e.48xlarge
-> 8 H200 are the biggest you can get 1TB of ram - Use docker images with cuda installations
- if you dont know which instance ot use go for T3 instances first also T3 instances has a traffic limit eg: 5Gbps(Network erformance) EBS instances
- Accelearated computing -> fully connected layers works good (
Trn1
) accerator are not gpus. Good cost optimization can be achieved with this for inference purposes - Spot instance - 10% of cost and use it with peresistent storage and make sure you cancel the spot request after usage
Networking
- only through vpc internet is accessed any thing inbound or out bound and only allow certain ports that u want
- https: 443, http: 80
- In same vpc we can use private ip to connect to another private ip
- spot fleet request -> use load balancer
AWS and local VScode connection
Backup and storage
- EBS snapshot -> enable backup
- EBS helps to reduce the volume storage
S3 instance
- S3
- s3 standard
- s3 glacier - for long time usage and cost is less without retrival and then
AWS configure
- AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION needs to be provided
aws s3 ls
- test command for connection
Create a AMI
# Configure with your accesskey and secret
aws configure
# From you own instance where you are fetches the instance-id and pushes the AMI to private ami location
aws ec2 create-image \
--instance-id $(curl -s http://169.254.169.254/latest/meta-data/instance-id) \
--name "Session-09-ami-Nov-19-1" \
--description "AMI created programmatically from this instance" \
--no-reboot
# You would get a ami-id like this ami-0af5900df6f0bfaf4
Get instance information
echo "Checking instance information..."
# Check if we're on EC2
# TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
# curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-type
Using aws ec2 for ci/cd without github tools
Efficient AWS Instance Usage
- if request are very sparse then go with lambda 10 request per day like that .. take small gpu - smallest is t4 16 GB
AWS Lambda
- Even driven execution -> output shoud be sent to somewhere -> to s3 and cost 1 unit
- stateless - s3, dynamodb, rds(traditional databse) can be used to save
- scaling is done by AWS itself -> lambda is 1 million request free every month
- lambda - downside - no gpus -> serveless gpu with lambda like frame work some other providers have
- application is stateless we can use lambda
- request is not continuously being hit, if so lambda
- Decide between EC2 and lambda - ec2 instance exact reeuqest with lambda should be measured for performance and cost decision
- Monitoring - sns in aws is quick
- Limit: 16 GB ram in lambda is limit, 15 minutes limit for single request
- No request the lambda goes to cold state, or first start is cold start second request good time. so first request cold start is a drawback.
- cpu models work good, then if cold start if its a issue dont use aws lambda
- aws lambda cost after 1 million -> x86 -> 2M per 1M request so cost is very less
- larger docker size will take large time to start -> aws lambda so download the model from s3 and use in lambda
- load balancer is handled by aws lambda so its a big adavantage than ec2
- SemLock not possible in aws lambda but multiprocessing is possible https://ruan.dev/blog/2019/02/19/parallel-processing-on-aws-lambda-with-python-using-multiprocessing
- lambda from image -> 1.from ecr u can select the image to be deployed -> 2. or aws cdk - mange using code all start, stop .. its like terraform managing infrastructure. -> Infrastructure as code
- lambda-adapter is needed in docker file -> aws lambda event format .. it should be same like that so that we are using lambda-adapter.. so that it adapts your request to aws-adapter. examples : https://github.com/awslabs/aws-lambda-web-adapter/tree/main/examples/fastapi
- one lambda can be connected to another, then "add trigger" in lambda
- aws lambda has hot start time limit -> cloud watch can be used to see logs for aws lambda -> cloudwatch/log groups / lambda function name or monitoring
AWS CDK
- its like terraform managing infrastructure. -> Infrastructure as code
AWS API gateway
- timeout of 29 seconds for this service but lambda gives 15 minutes timeout max
Info
- good practise-> select a user and attach poliocies .. else some times it may create more resources
- Use cloud formation to check what resouces are created and manually destroy all resources
TODO
- In cml why only cuda 11.2 came and didn't 12.1 come ? what tool supports ec2+sport instace trigger
ctop
-> container utilzationtmux
session in terminalhtop
nv-top
- gpu and cpu utilzationrealpath .
realpath temp
gives absolute pathrsync
-> for parallel processing for copyinggpustate -cp
for checking gpu utilizationps -ef | grep python
vim
can be used for faster debugging in remotessh instanceurl.com python server.py
-> Ec2 loginnvitop
pip install .
-> for installing packages -> need pyproject.toml file
-
Either u get 1,2,4,8 gpus
-
inference -> modelsize*10%
-
model training -> model*2.5 times
-
70 Billion parameters - full floating point
-
70,000,000,000(parameters for model)*32(floating point)/8(Bits to byte conversion)/1024/1024/1024 = 260 GB
-
1024/1024/1024 -> Bits to byte conversion -> GB conversion
-
half bit - use 16 to 32
-
llama3.1-1B-model - best for budget model - 1,000,000,000(parameters for model)*16(floating point)/8(Bits to byte conversion)/1024/1024/1024 = 1.1 GB
-
calculate size for model inference -> multiply by 2.5 i.e required for training -> USE gpu size calculator then batch size and prompt length also matters
- To connect docker with host gpu you might need below commands
- For docker run -
--gpus all
- For docker compose - docker gpu docs
Test
- Inside docker run below you need to get
True
import torch torch.cuda.is_available()
(as of Dec, 2024)
- litServe, vLLM, fastapi, torchserve, ollama
- Use vLLM to serve to large number of users and with batch serving
- litserve, vllm are good for llm serving as they have additional caching and optimization mechanisms
- vllm and ollama are not that customizable
- torch serve in java written
- torch servce to litservce migration is possible and lit serve very new
- lit serve each gpu model setup is called once
- good habit to see GPU utilization when model is deployed but there is no monitoring function for it
Optimizations
- if batch size is multiple of 2 then it will be faster or power of 2 eg: 1526 is faster than 1527
- higher gpu usage increase batch size
- while doing batch processing if the cpu is 8 core and if we have 64 batch size then there will be more context switching
- what is difference between workers and thread
- how to use threds with async
- pqdm for jobs and it handles concurrently
- if 64 request are sent from client to server and server cant handle it parallely it will convert it to sequentialy
- llama3.1-1B-model - best for budget model - 1,000,000,000(parameters for model)*16(floating point)/8(Bits to byte conversion)/1024/1024/1024 = 1.1 GB
- question in a session-09 why does in api server there is low through put than the baseline model
fastapi
-> asgi -> completely async functionstorchserve
- healvily used in production- fast api faster than flask,
concurrency
-
request comes -> read file i/O operation eg: getting file from S3 while the i/O is happening another one can happen so that is called concurrency instead of cpu being idle another ccan be done.
-
types -> parallel, concurrent, concurrent and parallel
-
IN theory -> cpu bound task or i/o bound task (todo)
torchserve
- swapping a model, i.e two version running and stop one and start new model again
- torch serve+promethus+grapfana, torch serve can register multiple models
- Torch serve can deploy cpu based also, even without model also it can do
- model packaging, deploy config, monitoring
- Use vLLM to serve to large number of users and with batch serving
https://platform.openai.com/tokenizer
- measure token -> bytepair encoding tokenizing andgoogle
uses word by word- In python only during run time the code is compiled so it make totaly solow and its a scripting language so pytorch came up with torchscript first time the cuda kernel are compiled at first so it makes first inference slow and make warm up
fastapi
-> completely async functions
torchscript
-
torchscipt -> own compilation and then makes code faster -> Also no need Model class -> just .pt file is enough can use in cpp, only in inference it can taken, in browser also , in android also
-
torchscirpt traced model are stored in .pt -> No need any class or instance creation for model other pytorch models are stored in .pth
-
torchscripts -> saves all the modules in .pt file i.e is in .forward() , only custom layers cant be saved -> 10 to 20% faster
-
Refer Multi GPU Training in Pytorch lighting
Other conversions
- onnx runtime -> represention of model with weights in a single file
- Fast api -> nn.Transformer -> has optimization for pytorch that supports a all kinds of gpu (todo: check if nn.Transformer is related to fastapi )
- tensor-rt -> heavily optimized -> tensor-rt model optimizer takes care of it
Integrations
-
Hydra
-
optuna
-
lora
-
peft
-
comet, mlflow
-
lighting-gradio integration Multi GPU Training
-
pytorch multigpu training --
ddp
- averaged out by master node -
DDP -> averaged out by master node
-
in pytorch lighting -> strategy - ddp, -> 100 million param -> create 8 copies
for each GPU num_nodes- -> 8 copies run in 2 gpus 1 master note and other nodes -> each will get copy forward pass each of the gpua nd node, gradient are computed and averaged by master node consolidates - cons: a very large model we can train DDP-> only for training
-
FSDP
-> splites the model to 8 parts for 8 gpu and it can train and consolidates FSDP -> only for training https://pytorch.org/tutorials/_images/fsdp_workflow.png sharding method used -> check what it is
- self-hosted-runner - commands from runner should be ran in aws ec2
- github-hosted-runner
- auto start and auto stop ec2 spot instance using a custom ami which we are giving
Gradio
- for 3D, 2d , text or anything -(backend python , front end swelt)
- flagging - for detecting false images or anomaly
- share=True -> share from one laptop to another
- lighting-gradio integration
- gr.Model3D -> 3D rendering
- live inferencing ->
WebRTC
component in gradio - gradio -> SimpleCSVlogger(), it locks the log file and writes to csvlogger to avoid race condition
- fast api faster than flask
- wsgi-> synchronous copies
- unvicorn, wsgi, nginx
- In api, we cant do batching only in litserve we can do batching
- CORS
if in origins = ["*"] and if one domain name is calling another domain name then its not possible u need to add app.add_middleware and origins. CORS error we need to add above
- fastapi+jinja template
- In fastapi - /docs gives all end points, /redoc gives another some documentation
Others
pyodide
fasthtml
flux.11-dev
- hf becoming popularhttps://huggingface.co/spaces/enzostvs/zero-gpu-spaces
- All huggingface model deployed is CPU and its free tier
- End to end pipeline to deploy model -> after training just create torchscript and deploy
- Major models -
https://aiworld.eu/embed/model/model/treemap
photon
by luma - use image to convert to videohttps://x.com/fofrAI
- FLUX- best model - 24GB ram needed
- replica.co
Large models/ trasnformers
good for segmentationinternimage
- for segmentationeva
- for segmentationdepth_pro
- apple's model - good for dept estimation -(more than MIDAs)briyal/RMBG-2.0
- Diffusion meaning - random image and keeps upgrading so diffusion
- prompting reference -
https://stability.ai/learning-hub/stable-diffusion-3-5-prompt-guide
- - stable diffusion - IMG_302.CR2 creates a photo realistic image because it was trained thinking that
.CR2
is a DSLR image
S.NO | Purpose | Package Name |
---|---|---|
1 | Argument/Config management | Hydra |
2 | Logging | aim, Comet, MLflow |
3 | Data versioning | DVC with Cloud (GCS) |
4 | Markdown file generation | Tabulate |
5 | Code formatter | Black |
6 | Google Drive data download | gdown |
7 | GitHub Actions commenting | cml |
8 | Unit testing python | Pytest |
9 | Test coverage reporting | Coverage |
10 | AI code assistance | Cursor |
11 | Hyper parameter Optimization | Optuna |
12 | Multi run parallel | Joblib |
13 | Run Github actions locally | act |
14 | GPU Requirement calculator | GPU_poor |
15 | Reduce model size of FC layers | torch compile |
16 | Quantatization | torch ao |
17 | aws cli | awscli |
18 | env variable | rootutils |
19 | sets basic root | .projectroot |
20 | tools for serving machine learning models(genric) | litServe |
21 | tools for serving machine learning models(LLM specific) | vLLM |
22 | tools for serving machine learning models(genric) | fastapi |
23 | tools for serving machine learning models(genric) | torchserve |
24 | tools for serving machine learning models(LLM specific) | ollama |
25 | load testing | locust |
26 | Handle concurrancy | pqdm |
27 | llm optimization | peft |
28 | llm optimization | trl |
29 | llm optimization | lora |
30 | AWS alternative to GitHub | codecommit |
31 | AWS alternative to GitHub actions | codepipeline |
32 | LORA for vit | timm-vit-lora |
33 | testing api | postman |
34 | Quick ML demo with UI | gradio |
35 | Quick ML demo with UI (python based) | pyodide |
36 | Quick ML demo with UI | fasthtml |
37 | Quick ML demo with UI | streamlit |
38 | Optimized model | torchscript |
39 | represention of model with weights in a single file | onnx runtime |
40 | web servers | unvicorn |
41 | web servers python syc | wsgi |
42 | web servers python asyc | asgi |
43 | web servers - front-end reverse proxy | nginx |
44 | background monitoring | celery python |
45 | Stable diffusion UI | next-js sd3 |