Improvements to Benchmark Scripts and Config Generation Workflow #13

fabianlim · 2024-05-17T06:26:31Z

Improvements to benchmarks

From now on, all benchmarks need to run in tox environment for package version hygiene.

tox -e run_benches

note, since the PR is not yet merged we need to set FHT_BRANCH=accel-pr

In addition

the tox command above accepts environment variables DRY_RUN, NO_DATA_PROCESSING, NO_OVERWRITE. See scripts/run_benchmarks.sh
in run_benchmarks.sh we will clear the RESULT_DIR if it exists, to avoid contaimination with old results. To protect against overwrite, then always run with NO_OVERWRITE=true.
also the run_benchmarks.sh script will produce two CSVs now.
- raw_summary.csv: this is the original one previously called summary.csv. This one will contain the raw results
- benchmarks.csv: this one is a processed version of raw_summary.csv. it only contains that columns that differ for easier viewing.
- TODO: consider providing one more file for columns that are the same
in run_benchmarks.sh we also pip freeze a requirements.txt file inside the tox environment.
- we should check in this file as well
- TODO: does tox have a version lock file?

Script for Producing CSV report

After running a few benchmarks, we can gather all the results into a single CSV report.

can be done in incremental manner, even when the benches are still running
works with multiple benchmark directories, just specify them one after another

# do the following in the repo directory
# activate the tox environment
source .tox/run-benches/bin/activate

# run the display-bench-results.py on a directory with benchmark results 
# - say "benchmark_outputs"
PYTHONPATH= python scripts/benchmarks/display-bench-results.py benchmark_outputs

This will produce an output like this, and then the .csv report can be read by pandas.read_csv

***************** Report Created ******************
Total lines: '48'
Number columns included: '20'
Number columns excluded: '20'
Excluding number of exceptions caught: '0'
Written report to 'results.csv'

Improvements to Generate Configs

We added a new tox -e verify-configs to ensure that the configs are correctly generated.

this is now enabled as a workflow.
this will run tox -e gen-configs as a worflow, and test against the file that was checked in. If it differs it will fail the build.
This ensures that the sample-configuratinos are always up-to-date with the plugin configs

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

…or GPTQ-LoRA (#20) * Add GitHub Workflow for Linting , Formatting and Test. Activate Workflow for Framework (#7) * add lint workflow Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add pylintrc, update .tox fix files Signed-off-by: Yu Chin Fabian Lim <[email protected]> * activate test and minor fix Signed-off-by: Yu Chin Fabian Lim <[email protected]> * lint benchmarks.py and add workflow to dev Signed-off-by: Yu Chin Fabian Lim <[email protected]> --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Improvements to Benchmark Scripts and Config Generation Workflow (#13) * fix benches and add verify configs Signed-off-by: Yu Chin Fabian Lim <[email protected]> * update readme and add workflow Signed-off-by: Yu Chin Fabian Lim <[email protected]> * add packaging dep Signed-off-by: Yu Chin Fabian Lim <[email protected]> * update torch dep in framework and run-benches Signed-off-by: Yu Chin Fabian Lim <[email protected]> * take host env in run-benches * add display bench results script * rename summary.csv to raw_summary.csv and update run_benchmarks.sh * export environment variables in shell command * dump out pip requirements for repro, and add default FHT_branch --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Added support for running official HF baseline FSDP-QLoRA benchmark (#16) * new baseline scenario * rename variables * added warning when plugin allows SFTTrainer to handle PEFT on single device * Fix FSDP when performing GPTQ-LoRA with Triton V2 (#15) * wrap in parameters and torch view to correct dtype Signed-off-by: Yu Chin Fabian Lim <[email protected]> * refactor to apply patch only on FSDP and simplify Signed-off-by: Yu Chin Fabian Lim <[email protected]> --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> * Provide Memory Benchmarking Feature to Benchmarking Code (#14) * add gpu memory logging support * made improvements to GPU reference and result collation * Renamed memory logging argument to reflect its readings as reserved me mory using nvidia-smi and changed aggregation function in result collation * variable renames * manual linting * added memory logging functionality via HFTrainer * added support to benchmark memory using HFTrainer and updated READMEwith explanation of the 2 memory benchmarking options * addressed changes requested in PR #14 * fix bug and smplify gpu logs aggregation logic * fixes to calculation of HFTrainer Mem Logging values * fix calculations * more fixes * fix to ignore including stage inside max calculation of alloc memory * more comments and README updates * added fix to keyerror due to empty output dict from OOM * manual linting * added benchmark results to refs * remove unnecessary columns in results gathering * made changes to results gathering --------- Signed-off-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: achew010 <[email protected]>

fix benches and add verify configs

966735e

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim changed the base branch from main to dev May 17, 2024 06:26

update readme and add workflow

697bbca

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim force-pushed the gen-configs branch from d42e761 to 697bbca Compare May 17, 2024 06:29

fabianlim mentioned this pull request May 17, 2024

Fix Issues With Benchmark Script #2

Closed

5 tasks

add packaging dep

0aa4437

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim force-pushed the gen-configs branch from 5d53664 to 0aa4437 Compare May 17, 2024 08:06

update torch dep in framework and run-benches

108a639

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim force-pushed the gen-configs branch from b93c8c3 to 108a639 Compare May 17, 2024 14:48

fabianlim added 3 commits May 17, 2024 15:42

take host env in run-benches

3635e7e

add display bench results script

4ea816f

rename summary.csv to raw_summary.csv and update run_benchmarks.sh

0d5365a

fabianlim requested a review from achew010 May 18, 2024 08:31

fabianlim added 2 commits May 19, 2024 06:28

export environment variables in shell command

6fcbbfe

dump out pip requirements for repro, and add default FHT_branch

4fbc88b

fabianlim force-pushed the gen-configs branch from 04edbcf to 4fbc88b Compare May 20, 2024 03:41

achew010 approved these changes May 20, 2024

View reviewed changes

fabianlim merged commit 1c790ed into dev May 20, 2024
2 checks passed

fabianlim deleted the gen-configs branch May 20, 2024 10:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Benchmark Scripts and Config Generation Workflow #13

Improvements to Benchmark Scripts and Config Generation Workflow #13

fabianlim commented May 17, 2024 •

edited

Loading

Improvements to Benchmark Scripts and Config Generation Workflow #13

Improvements to Benchmark Scripts and Config Generation Workflow #13

Conversation

fabianlim commented May 17, 2024 • edited Loading