-
Notifications
You must be signed in to change notification settings - Fork 16
Pull requests: hamelsmu/llama-inference
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
CTranslate2: support multiple GPUs (if run with mpirun) and Flash Attention 2.
#17
opened Apr 11, 2024 by
ivanbaldo
Loading…
OpenAI compatible servers benchmark based on the anyscale and exllama benchmarks.
#16
opened Apr 8, 2024 by
ivanbaldo
Loading…
Add new candle-vllm Dockerfile with instructions to benchmark it.
#13
opened Feb 8, 2024 by
ivanbaldo
Loading…
Add ctranslate/Dockerfile with instructions to use it.
#10
opened Jan 19, 2024 by
ivanbaldo
Loading…
hf/bench.py: need to specify bfloat16 otherwise it consumes twice as much memory in A10.
#9
opened Jan 17, 2024 by
ivanbaldo
Loading…
Add a Dockerfile for the /hf benchmarks with instructions to build and run them.
#7
opened Dec 11, 2023 by
ivanbaldo
Loading…
Add support for all Hugging Face Chat, Text models + OpenAI, Claude2, Cohere, Palm, Replicate models
#3
opened Aug 14, 2023 by
ishaan-jaff
Loading…
ProTip!
What’s not been updated in a month: updated:<2024-12-11.