Testing Code Generation using alt-models (Local LLMs)
Two tests so far - Fibonacci function and stock market chart. More to come.
Prompt for LLM: Write Python code to calculate the 14th Fibonacci number.
Code should generate: 233
or 377
Note: Fn for n = 14 is 377. If we start at n = 0 the 14th number is 233, so n = 14 is actually the 15th Fibonacci number in the sequence. Therefore, some LLM's code will generate 233 and some 377. We will accept both as the question is ambiguous.
See the results folder for code outputs.
See what ChatGPT 3.5 generated (which output 377).
Key | |
---|---|
✅ | Valid code and formatting |
❌ | Invalid code, formatting, or didn't understand task |
Model | Run 1 | Run 2 | Notes |
---|---|---|---|
CodeLlama 7b Python | ❌ | ❌ | |
CodeLlama 13b Python | ❌ | ❌ | |
CodeLlama 34b Instruct | ✅ | ✅ | |
CodeLlama 34b Python | ❌ | ❌ | |
DeepSeek Coder 6.7b | ✅ | ✅ | |
Llama2 7b Chat | ✅ | ❌ | |
Llama2 13b Chat | ✅ | ✅ | |
Mistral 7b 0.2 Instruct | ✅ | ✅ | |
Mixtral 8x7b Q4 | ✅ | ✅ | |
Mixtral 8x7b Q5 | ✅ | ✅ | |
Neural Chat 7b Chat | ✅ | ✅ | |
Nexus Raven | ❌ | ✅ | |
OpenHermes 7b Mistral | ✅ | ✅ | |
Orca2 13b | ✅ | ❌ | |
Phi-2 | ✅ | ❌ | |
Phind-CodeLlama34b | ✅ | ✅ | |
Qwen 14b | ✅ | ✅ | |
Solar 10.7b Instruct | ❌ | ✅ | |
StarCoder2 3b | |||
StarCoder2 7b | |||
StarCoder2 15b | |||
Yi-34b Chat | ✅ | ✅ |
Prompt for LLM: Today is {today}. Write Python code to plot TSLA's and META's stock prices YTD.
Code should generate: An image for each stock OR an image including both stocks. If it tries to display it (Jupyter notebook) it's a bonus (but doesn't affect whether it's correct or not).
See the results folder for code outputs.
See what ChatGPT 3.5 generated (which was correct).
Key | |
---|---|
✅ | Valid code and formatting |
🔶 | Almost correct |
❌ | Invalid code, formatting, or didn't understand task |
Model | Run 1 | Run 2 | Notes |
---|---|---|---|
CodeLlama 7b Python | ❌ | ❌ | |
CodeLlama 13b Python | 🔶 | ❌ | Just missing "import pandas as pd" |
CodeLlama 34b Instruct | ✅ | ✅ | |
CodeLlama 34b Python | ❌ | ❌ | Expected a CSV file |
DeepSeek Coder 6.7b | 🔶 | ✅ | Out of date library usage |
Llama2 7b Chat | ❌ | 🔶 | Close, showed price change instead of price |
Llama2 13b Chat | ❌ | ❌ | |
Mistral 7b 0.2 Instruct | 🔶 | 🔶 | Minor string formatting error, used "FB" instead of "META" as ticker symbol |
Mixtral 8x7b Q4 | ✅ | ❌ | |
Mixtral 8x7b Q5 | ✅ | 🔶 | Not quite right on second run |
Neural Chat 7b Chat | ❌ | ❌ | |
Nexus Raven | ❌ | ❌ | First run no code, second is a blank chart |
OpenHermes 7b Mistral | ❌ | ❌ | |
Orca2 13b | ❌ | ❌ | No code |
Phi-2 | ❌ | ❌ | Off the track |
Phind-CodeLlama34b | ✅ | ❌ | |
Qwen 14b | 🔶 | ❌ | Timeframe out |
Solar 10.7b Instruct | ❌ | ❌ | |
StarCoder2 3b | |||
StarCoder2 7b | |||
StarCoder2 15b | |||
Yi-34b Chat | ❌ | ❌ | Old retired library use |
Prompt for LLM: Draw two agents chatting with each other with an example dialog. Don't add plt.show().
Code should generate: Python code that creates a diagram with two agents on it (E.g. two circles and speech bubbles)
See the results folder for code outputs.
See what OpenAI's API generated.
Key | |
---|---|
✅ | Valid code and formatting |
🔶 | Almost correct |
❌ | Invalid code, formatting, or didn't understand task |
Model | Run 1 | Run 2 | Notes |
---|---|---|---|
CodeLlama 7b Python | ❌ | ❌ | |
CodeLlama 13b Python | ❌ | ❌ | |
CodeLlama 34b Instruct | ❌ | 🔶 | Coded an animation with speech bubbles, bug in function parameters, once fixed produced animated text. |
CodeLlama 34b Python | ❌ | ❌ | |
DeepSeek Coder 6.7b | ❌ | ❌ | Produced a chat conversation |
Llama2 7b Chat | ❌ | ❌ | Produced a chat conversation |
Llama2 13b Chat | ❌ | ❌ | Produced a chat conversation |
Mistral 7b 0.2 Instruct | ❌ | ❌ | Produced a chat conversation |
Mixtral 8x7b Q4 | ❌ | ❌ | Produced a chat conversation |
Mixtral 8x7b Q5 | ❌ | ❌ | Produced a chat conversation |
Neural Chat 7b Chat | ❌ | ❌ | Tried to create visual animation but failed and ended up producing a chat conversation |
Nexus Raven | ❌ | ❌ | |
OpenHermes 7b Mistral | ❌ | ❌ | Produced a chat conversation |
Orca2 13b | ❌ | ❌ | |
Phi-2 | ❌ | ❌ | |
Phind-CodeLlama34b | 🔶 | 🔶 | Valid visual generation of two agents' text chat (no imagery just text on a figure) |
Qwen 14b | ❌ | ❌ | Produced a chat conversation |
Solar 10.7b Instruct | ❌ | ❌ | Produced a chat conversation |
StarCoder2 3b | |||
StarCoder2 7b | |||
StarCoder2 15b | |||
Yi-34b Chat | 🔶 | 🔶 | Close to a valid drawing, outdated libraries |
Background: Tests the ability for an LLM to incorporate a termination word into their response.
Scenario: A Group Chat with a Story_writer and a Product_manager. Story_writer is to write some story ideas and the Product_manager is to review and terminate when satisified by outputting a specific word (e.g. "TERMINATE", "BAZINGA", etc.).
Store_writer's system message: An ideas person, loves coming up with ideas for kids books.
Product_manager's system message: Great in evaluating story ideas from your writers and determining whether they would be unique and interesting for kids. Reply with suggested improvements if they aren't good enough, otherwise reply '{termination_word}' at the end when you're satisfied there's one good story idea.
Prompt for the chat manager: Come up with 3 story ideas for Grade 3 kids.
See the results folder for code outputs.
Note 1: TERMINATE
is the standard used by AutoGen.
Note 2: Some LLMs included the terminating word but the quality of the full response was not perfect.
Key | |
---|---|
✅ | Included termination word correctly |
❌ | Performed task, didn't output termination word |
👎 | Didn't understand/participate in task |
There were two runs for each word.
Model | TERMINATE | ACBDEGFHIKJL | AUTOGENDONE | BAZINGA! | DONESKI | Notes |
---|---|---|---|---|---|---|
CodeLlama 7b Python | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | |
CodeLlama 13b Python | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | |
CodeLlama 34b Instruct | ✅ ✅ | ❌ ✅ | ❌ ✅ | ✅ ✅ | ✅ ❌ | |
CodeLlama 34b Python | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | |
DeepSeek Coder 6.7b | ❌ 👎 | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | |
Llama2 7b Chat | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Llama2 13b Chat | ❌ ✅ | ❌ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Mistral 7b 0.2 Instruct | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Mixtral 8x7b Q4 | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Mixtral 8x7b Q5 | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Neural Chat 7b Chat | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Nexus Raven | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | 👎 👎 | Tried to call a python function to create the stories |
OpenHermes 7b Mistral | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Orca2 13b | ✅ ✅ | ❌ ✅ | ❌ ✅ | ❌ ✅ | ✅ ✅ | |
Phi-2 | 👎 👎 | 👎 ❌ | ❌ ❌ | ❌ ❌ | ❌ 👎 | |
Phind-CodeLlama34b | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Qwen 14b | ❌ ❌ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
Solar 10.7b Instruct | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | |
StarCoder2 3b | ||||||
StarCoder2 7b | ||||||
StarCoder2 15b | ||||||
Yi-34b Chat | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ | ✅ ✅ |