Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sharktank] Split Perplexity CI #452

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

name: CI - Perplexity
name: CI - Perplexity IREE

on:
workflow_dispatch:
Expand All @@ -21,9 +21,9 @@ concurrency:
cancel-in-progress: true

jobs:
test_perplexity_vmfb:
test_perplexity_iree:
timeout-minutes: 1000
name: "IREE/vmfb"
name: "Perplexity IREE"
strategy:
matrix:
version: [3.11]
Expand Down Expand Up @@ -71,51 +71,5 @@ jobs:
iree-base-compiler \
iree-base-runtime \
"numpy<2.0"
- name: Run perplexity test with vmfb
run: pytest -n 8 -v -s sharktank/tests/evaluate/perplexity_vmfb_test.py --longrun --iree-device='hip://7' --iree-hip-target=gfx942 --iree-hal-target-backends=rocm --llama3-8b-f16-model-path=/data/llama3.1/8b/llama8b_f16.irpa --llama3-8b-tokenizer-path=/data/llama3.1/8b/tokenizer_config.json

test_perplexity_torch:
timeout-minutes: 1000
name: "Torch/eager mode"
strategy:
matrix:
version: [3.11]
runs-on: [llama-mi300x-3]
fail-fast: false
runs-on: ${{matrix.runs-on}}
defaults:
run:
shell: bash
env:
PIP_CACHE_DIR: "${{ github.workspace }}/.pip-cache"
SHARK_PLATFORM_REPO_ROOT: ${{ github.workspace }}
steps:
- name: "Setting up Python"
id: setup_python
uses: actions/setup-python@v3
with:
python-version: ${{matrix.version}}

- name: "Checkout Code"
uses: actions/checkout@v3

- name: Cache Pip Packages
uses: actions/cache@v4
id: cache-pip
with:
path: ${{ env.PIP_CACHE_DIR }}
key: pip-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('*requirements.txt') }}

- name: Install sharktank deps
run: |
python -m pip install --no-compile --upgrade pip
# Note: We install in three steps in order to satisfy requirements
# from non default locations first. Installing the PyTorch CPU
# wheels saves multiple minutes and a lot of bandwidth on runner setup.
pip install --no-compile -r pytorch-cpu-requirements.txt
pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \
-e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine"
pip install --no-compile -r requirements.txt -r sharktank/requirements-tests.txt -e sharktank/

- name: Run perplexity test in eager mode
run: pytest -n 8 -v -s sharktank/tests/evaluate/perplexity_torch_test.py --longrun --llama3-8b-f16-model-path=/data/llama3.1/8b/llama8b_f16.irpa --llama3-8b-tokenizer-path=/data/llama3.1/8b/tokenizer_config.json
- name: Run perplexity test with IREE
run: pytest -n 8 -v -s sharktank/tests/evaluate/perplexity_iree_test.py --longrun --iree-device='hip://7' --iree-hip-target=gfx942 --iree-hal-target-backends=rocm --llama3-8b-f16-model-path=/data/llama3.1/8b/llama8b_f16.irpa --llama3-8b-tokenizer-path=/data/llama3.1/8b/tokenizer_config.json
68 changes: 68 additions & 0 deletions .github/workflows/ci_eval_torch.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Copyright 2024 Advanced Micro Devices, Inc.
#
# Licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

name: CI - Perplexity Torch

on:
workflow_dispatch:
schedule:
# Weekdays nightly at 07:00 UTC = 23:00 PST / 00:00 PDT.
- cron: "0 7 * * 1-5"

concurrency:
# A PR number if a pull request and otherwise the commit hash. This cancels
# queued and in-progress runs for the same PR (presubmit) or commit
# (postsubmit). The workflow name is prepended to avoid conflicts between
# different workflows.
group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
cancel-in-progress: true

jobs:
test_perplexity_torch:
timeout-minutes: 1000
name: "Perplexity Torch"
strategy:
matrix:
version: [3.11]
runs-on: [llama-mi300x-3]
fail-fast: false
runs-on: ${{matrix.runs-on}}
defaults:
run:
shell: bash
env:
PIP_CACHE_DIR: "${{ github.workspace }}/.pip-cache"
SHARK_PLATFORM_REPO_ROOT: ${{ github.workspace }}
steps:
- name: "Setting up Python"
id: setup_python
uses: actions/setup-python@v3
with:
python-version: ${{matrix.version}}

- name: "Checkout Code"
uses: actions/checkout@v3

- name: Cache Pip Packages
uses: actions/cache@v4
id: cache-pip
with:
path: ${{ env.PIP_CACHE_DIR }}
key: pip-${{ steps.setup_python.outputs.python-version }}-${{ hashFiles('*requirements.txt') }}

- name: Install sharktank deps
run: |
python -m pip install --no-compile --upgrade pip
# Note: We install in three steps in order to satisfy requirements
# from non default locations first. Installing the PyTorch CPU
# wheels saves multiple minutes and a lot of bandwidth on runner setup.
pip install --no-compile -r pytorch-cpu-requirements.txt
pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps \
-e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine"
pip install --no-compile -r requirements.txt -r sharktank/requirements-tests.txt -e sharktank/
Comment on lines +56 to +65
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this workflow runs nightly, we could also switch from using an explicit full project build with the latest deps of all packages to using a nightly release build: https://github.com/nod-ai/SHARK-Platform/blob/main/docs/nightly_releases.md#quickstart---sharktank. For a number of these workflows I think we should be testing with the stable versions of dependencies (iree-base-compiler, iree-base-runtime, iree-turbine) and the latest nightly / source versions of each.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this needs to be a separate PR where all CIs are switched to the latest nightly release build. If not can update here.

About testing both stable and nightly of IREE, this CI might not be the right candidate. Currently 8b fp16 takes 3 hrs and we plan to add more models, quantizations and decomposed/non-decomposed, stretching it to > 12 hrs for one set of models to finish.


- name: Run perplexity test with Torch
run: pytest -n 8 -v -s sharktank/tests/evaluate/perplexity_torch_test.py --longrun --llama3-8b-f16-model-path=/data/llama3.1/8b/llama8b_f16.irpa --llama3-8b-tokenizer-path=/data/llama3.1/8b/tokenizer_config.json
Comment on lines +67 to +68
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split Perplexity CI nightly workflow to Torch and IREE to be able to fetch/read their status separately, providing more clarity on which repo regressed.

Can you clarify what you mean by "which repo regressed"? We should generally only be testing things we control here.

What about the logs at https://github.com/nod-ai/SHARK-Platform/actions/runs/11659531084 isn't clear?

I want to trend towards a smaller number of workflow files, not a larger one. I'm already confused enough by the list at https://github.com/nod-ai/SHARK-Platform/actions and https://github.com/nod-ai/SHARK-Platform/tree/main/.github/workflows. We have a mix of workflows defined by subproject (e.g. sharktank, shortfin, tuner), model (e.g. llama, sdxl), or by test category (e.g. perplexity, eval). There is quite a bit of overlap there, and as long as it isn't obvious which workflow a given test belongs in, people will just add a new workflow uniquely suited to that purpose. I've been refactoring workflows lately as part of rolling out packaging, and we have a substantial amount of copy/paste and eventual drift between workflows that I'm having to navigate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is a better way to categorize the workflows.

The sole reason to split ci_eval.yaml is to have 2 workflow badges at sharktank/README.md for Torch and IREE. So when IREE fails we know it's IREE regression and not sharktank.

I understand the logs are clear, but for our devs, workflow badges offer an easier way to keep informed about the CIs/regressions if any, instead of having to dig through the workflow logs. Like you said there are quite a few of them.

Let me know if there is an alternate way to have 2 workflow badges for each job running in a single workflow yaml, without splitting it. I couldn't find any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll respond in more detail tomorrow. Re-requested review to keep it in my queue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/evaluate/perplexity_torch.py stays in native PyTorch to test our model implementations, while https://github.com/nod-ai/SHARK-Platform/blob/main/sharktank/sharktank/evaluate/perplexity_vmfb.py exports to IREE then compiles and runs, right?

The sole reason to split ci_eval.yaml is to have 2 workflow badges at sharktank/README.md for Torch and IREE. So when IREE fails we know it's IREE regression and not sharktank.

I'm not sure I buy this argument. Taken to the extreme, we could have one workflow per test, so we know exactly which test failed based on the workflow badges.

The possible technical reasons to split the workflow are:

  • finer control over workflow triggers (e.g. run IREE eval on every commit and PyTorch eval nightly, or use workflow_dispatch to run one job but not the other)
  • monitoring via workflow badges

For the monitoring side, regressions should be rare enough and addressed quickly enough that a bit of clicking through to see which sub-job failed is pretty reasonable in my opinion.

As these are long running jobs, I do actually like splitting for the other reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct about the 2 perplexity scripts.

A lot of fast moving things on both IREE, codegen and sharktank is why we feel the need to have separate badges to track regressions faster. If you noticed yesterday's nightly broke from an IREE-turbine regression.

2 changes: 1 addition & 1 deletion sharktank/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ tooling.

## Project Status

[![CI - Perplexity](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci_eval.yaml/badge.svg?branch=main&event=schedule)](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci_eval.yaml)
[![CI - Perplexity Torch](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci_eval_torch.yaml/badge.svg?branch=main&event=schedule)](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci_eval_torch.yaml) [![CI - Perplexity IREE](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci_eval_iree.yaml/badge.svg?branch=main&event=schedule)](https://github.com/nod-ai/SHARK-Platform/actions/workflows/ci_eval_iree.yaml)

## Examples

Expand Down
17 changes: 16 additions & 1 deletion sharktank/sharktank/evaluate/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,31 @@ pip install -r sharktank/requirements-tests.txt

### Perplexity

Perplexity score measures the ability of a language model to predict the next token in a sequence. A lower score indicates that a model has higher certainty in it's predictions. Perplexity acts as an intrinsic evaluation metric that measures the model quality, independent of any downstream task.

In SHARK-Platform, we use perplexity to track code regressions and quality loss across quantized models (with FP16 as baseline). We use 100 prompts from the Wikitext-2 test set and calculate the mean perplexities shown below. These numbers are neither comparable between models with different tokenizers nor with other projects due to varying implementations.

Test perplexity for Llama3.1 8B & 405B (FP16 & FP8) models:

```bash
pytest sharktank/tests/evaluate/perplexity_test.py --longrun
```

Get perplexity for a new model:
Calculate the perplexity for a new model:

```bash
python -m sharktank.evaluate.perplexity \
--gguf-file=llama3_70b_f16.gguf \
--tokenizer-config-json=tokenizer_config.json
```

### LLaMA 3.1 Scoreboard

| CPU | GPU |
|:---------------|:-----------|
| AMD EPYC 9554 | MI300X |


|Models |Model size (GB) |Torch |IREE |
|:--------|:---------------|:----------|:----------|
|8B f16 |16.07 |14.930181 |14.991893 |
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import pytest
import json

from sharktank.evaluate import perplexity_vmfb
from sharktank.evaluate import perplexity_iree

longrun = pytest.mark.skipif("not config.getoption('longrun')")

Expand All @@ -35,7 +35,7 @@ def test_llama3_8B_f16_decomposed(self):
model_name = "llama3_8B_f16_decomposed_vmfb"
baseline_perplexity = self.baseline_perplexity[model_name]

current_perplexity = perplexity_vmfb.main(
current_perplexity = perplexity_iree.main(
[
f"--irpa-file={self.llama3_8b_f16_model}",
f"--tokenizer-config-json={self.llama3_8b_tokenizer}",
Expand Down Expand Up @@ -70,7 +70,7 @@ def test_llama3_8B_f16(self):
model_name = "llama3_8B_f16_vmfb"
baseline_perplexity = self.baseline_perplexity[model_name]

current_perplexity = perplexity_vmfb.main(
current_perplexity = perplexity_iree.main(
[
f"--irpa-file={self.llama3_8b_f16_model}",
f"--tokenizer-config-json={self.llama3_8b_tokenizer}",
Expand Down Expand Up @@ -105,7 +105,7 @@ def test_llama3_8B_fp8_decomposed(self):
model_name = "llama3_8B_fp8_decomposed_vmfb"
baseline_perplexity = self.baseline_perplexity[model_name]

current_perplexity = perplexity_vmfb.main(
current_perplexity = perplexity_iree.main(
[
f"--irpa-file={self.llama3_8b_fp8_model}",
f"--tokenizer-config-json={self.llama3_8b_tokenizer}",
Expand Down Expand Up @@ -140,7 +140,7 @@ def test_llama3_8B_fp8(self):
model_name = "llama3_8B_fp8_vmfb"
baseline_perplexity = self.baseline_perplexity[model_name]

current_perplexity = perplexity_vmfb.main(
current_perplexity = perplexity_iree.main(
[
f"--irpa-file={self.llama3_8b_fp8_model}",
f"--tokenizer-config-json={self.llama3_8b_tokenizer}",
Expand Down Expand Up @@ -175,7 +175,7 @@ def test_llama3_405B_f16_decomposed(self):
model_name = "llama3_405B_f16_decomposed_vmfb"
baseline_perplexity = self.baseline_perplexity[model_name]

current_perplexity = perplexity_vmfb.main(
current_perplexity = perplexity_iree.main(
[
f"--irpa-file={self.llama3_405b_f16_model}",
f"--tokenizer-config-json={self.llama3_405b_tokenizer}",
Expand Down Expand Up @@ -210,7 +210,7 @@ def test_llama3_405B_f16(self):
model_name = "llama3_405B_f16_vmfb"
baseline_perplexity = self.baseline_perplexity[model_name]

current_perplexity = perplexity_vmfb.main(
current_perplexity = perplexity_iree.main(
[
f"--irpa-file={self.llama3_405b_f16_model}",
f"--tokenizer-config-json={self.llama3_405b_tokenizer}",
Expand Down Expand Up @@ -245,7 +245,7 @@ def test_llama3_405B_fp8_decomposed(self):
model_name = "llama3_405B_fp8_decomposed_vmfb"
baseline_perplexity = self.baseline_perplexity[model_name]

current_perplexity = perplexity_vmfb.main(
current_perplexity = perplexity_iree.main(
[
f"--irpa-file={self.llama3_405b_fp8_model}",
f"--tokenizer-config-json={self.llama3_405b_tokenizer}",
Expand Down Expand Up @@ -280,7 +280,7 @@ def test_llama3_405B_fp8(self):
model_name = "llama3_405B_fp8_vmfb"
baseline_perplexity = self.baseline_perplexity[model_name]

current_perplexity = perplexity_vmfb.main(
current_perplexity = perplexity_iree.main(
[
f"--irpa-file={self.llama3_405b_fp8_model}",
f"--tokenizer-config-json={self.llama3_405b_tokenizer}",
Expand Down
Loading