This is a Spice.ai data and AI app.
- Spice.ai CLI installed
- OpenAI API key
- Hugging Face API token (optional, for LLaMA model)
curl
andjq
for API calls
To learn more about Spice.ai, take a look at the following resources:
- Spice.ai - learn about Spice.ai features, data, and API.
- Get started with Spice.ai - try out the API and make basic queries.
Connect with us on Discord - your feedback is appreciated!
- Fork the repository
https://github.com/jeadie/evals
into your GitHub org.
- Log into the Spice.ai Cloud Platform and create a new app called
evals
. The app will start empty. - Connect the app to your repository:
- Go to the App Settings tab and select Connect Repository.
- If the repository is not yet linked, follow the prompts to authenticate and link it.
- Set the app to Public:
- Navigate to the app's settings and toggle the visibility to public.
- Redeploy the app:
- Click Redeploy to load the datasets and configurations from the repository.
- Check the datasets in the Spice.ai Cloud:
- Verify that the datasets are correctly loaded and accessible.
- Test public access:
- Log in with a different account to confirm the app is accessible to external users.
-
Initialize a new local Spice app
mkdir demo cd demo spice init
-
Login to Spice.ai Cloud
spice login
-
Get spicepod from Spicerack Navigate to spicerack.org, search for
evals
.
Click on /evals, click on Use this app, and copy the spice connect
command.
Paste the command into the terminal.
Navigate to spicerack.org, search for evals
, click on /evals, click on Use this app, and copy the spice connect
command. Paste the command into the terminal.
spice connect <username>/evals
The spicepod.yml
should be updated to:
version: v1beta1
kind: Spicepod
name: demo
dependencies:
- Jeadie/evals
-
Add a model to the spicepod
models: - name: gpt-4o from: openai:gpt-4o params: openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
-
Start spice
spice run
-
Run an eval
curl -XPOST "http://localhost:8090/v1/evals/taxes" -H "Content-Type: application/json" -d '{ "model": "gpt-4o" }' | jq
-
Explore incorrect results
spice sql
SELECT input, output, actual FROM eval.results WHERE value=0.0 LIMIT 5;
-
Track the outputs of all AI model calls:
runtime: task_history: captured_output: truncated
-
Define a new view and evaluation:
views: - name: user_queries sql: | SELECT json_get_json(input, 'messages') AS input, json_get_str((captured_output -> 0), 'content') as ideal FROM runtime.task_history WHERE task='ai_completion' - name: latest_eval_runs sql: | SELECT model, MAX(created_at) as latest_run FROM eval.runs GROUP BY model - name: model_stats sql: | SELECT r.model, COUNT(*) as total_queries, SUM(CASE WHEN res.value = 1.0 THEN 1 ELSE 0 END) as correct_answers, AVG(res.value) as accuracy FROM eval.runs r JOIN latest_eval_runs lr ON r.model = lr.model AND r.created_at = lr.latest_run JOIN eval.results res ON res.run_id = r.id GROUP BY r.model evals: - name: mimic-user-queries description: | Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient. dataset: user_queries scorers: - match
-
Add a smaller model to the spicepod:
models: - name: llama3 from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct params: hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY } - name: gpt-4o # Keep previous model.
-
Verify models are loaded:
spice models
You should see both models listed:
NAME FROM STATUS gpt-4o openai:gpt-4o ready llama3 huggingface:huggingface.co/meta-llama/Llama-3.3-70B-Instruct ready
-
Restart the Spice app:
spice run
-
Test the larger model or run another eval:
spice chat
-
Run evaluations on both models:
# Run eval with GPT-4 curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-4o"}' | jq # Run eval with LLaMA curl -XPOST "http://localhost:8090/v1/evals/mimic-user-queries" \ -H "Content-Type: application/json" \ -d '{"model": "llama3"}' | jq
-
Compare model performance:
spice sql
SELECT model, total_queries, correct_answers, ROUND(accuracy * 100, 2) as accuracy_percentage FROM model_stats ORDER BY accuracy_percentage DESC;
This query will show:
- Total number of queries processed
- Number of correct answers
- Accuracy percentage as a percentage
You can use these metrics to decide if the smaller model provides acceptable performance for your use case.
Include the following spicepod.yml
for reference:
version: v1beta1
kind: Spicepod
name: demo
dependencies:
- Jeadie/evals
runtime:
task_history:
captured_output: truncated
views:
- name: user_queries
sql: |
SELECT
json_get_json(input, 'messages') AS input,
json_get_str((captured_output -> 0), 'content') as ideal
FROM runtime.task_history
WHERE task='ai_completion'
evals:
- name: mimic-user-queries
description: |
Evaluates how well a model can copy the exact answers already returned to a user. Useful for testing if a smaller/cheaper model is sufficient.
dataset: user_queries
scorers:
- match
models:
- name: gpt-4o
from: openai:gpt-4o
params:
openai_api_key: ${ secrets:SPICE_OPENAI_API_KEY }
- name: llama3
from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
params:
hf_token: ${ secrets:SPICE_HUGGINGFACE_API_KEY }