Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding cost script to create a costs summary #26

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
1 change: 1 addition & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ ADD scripts/enable_api.sh /opt/scripts/enable_api.sh
ADD scripts/estimate_billing.py /opt/scripts/estimate_billing.py
ADD scripts/persist_artifacts.py /opt/scripts/persist_artifacts.py
ADD scripts/costs_json_to_csv.py /opt/scripts/costs_json_to_csv.py
ADD scripts/cost_script.py /opt/scripts/cost_script.py

# GMS setup/run
ADD gms/resources.sh /opt/gms/resources.sh
Expand Down
10 changes: 10 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,16 @@ This functionality is also wrapped into estimate\_billing.py under the
I'd still run these separately just to have both, but if you're only
after the CSV this may be more convenient.

# cost\_script.py

This is a script to be used on the costs tsv.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested edit:
`Takes the output of costs_json_to_csv.py and collapses tasks that have been rerun (due to failure or premption) and those that have been split into shards, giving one cost for the entire task.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, I only collapsed tasks that had been split into shards, and not ones that were recorded as 'retry'.
So, if there were rows as-
doBqsr.bqsr_shard-14_retry1
doBqsr.bqsr_shard-14
doBqsr.bqsr_shard-13_retry1
doBqsr.bqsr_shard-13
I collapsed them into-
doBqsr.bqsr
and
doBqsr.bqsr_retry1

That said, if we want to collapse all 4 of those into doBqsr.bqsr I can change the code to do that.

I'll also edit my comment with what you mentioned and make it clearer!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I think upon further reflection, keeping retries separate is probably the right move here, so yeah, you got it right. Nice work!


It will summarize the outputs of the tsv, adding up the costs for the same step.
And output a csv labeled costss_report_final.csv

Use as follows-

python3 /opt/scripts/cost_script.py costs.tsv

# Troubleshooting scripts

Expand Down
53 changes: 53 additions & 0 deletions scripts/cost_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/usr/bin/env python3

"""
Converts costs TSV file to summary costs TSV file

Usage: cost_script.py [costs tsv file]
"""
#Import modules
import sys
import pandas as pd
import regex as re

file=sys.argv[1]

#initialize list called table where we'll store all the values from tsv
table = []
with open(file) as f:
for line in f:
L = line.split('\t') #split by tab
table.append(L)

#delete anything that resembles 'shard' followed by a number. If there's a 3-digit shard at some point, add a re.sub(\d\d\d) line before the re.sub(\d\d) line.
#have to go in descending order of numerical digits because otherwise will delete shard# and be left with name-of-task# which won't get deleted.
for i in table:
ksinghal28 marked this conversation as resolved.
Show resolved Hide resolved
if "shard" in i[0]:
if "retry" in i[0]:
# print("retry",i[0])
i[0] = re.sub('_shard-\d\d','',i[0])
i[0] = re.sub('_shard-\d','',i[0])
# print(i[0])
else:
# print("no retry",i[0])
i[0] = re.sub('_shard-\d\d','',i[0])
i[0] = re.sub('_shard-\d','',i[0])
# print(i[0])


#convert list of lists to pandas dataframe using first list item as header. Grab specific columns we want. Drop the first row because it's just the list of column names
table_df = pd.DataFrame(table, columns=table[0])
table_df = table_df[["callName","totalCost","cpuCost","memoryCost","diskCost"]]
table_df=table_df.drop([0])

#convert all numerical values from strings to floats
table_df = table_df.astype({'totalCost':'float','cpuCost':'float','memoryCost':'float','diskCost':'float'})

#sum all rows with same callname
table_df_sum = table_df.groupby("callName").sum()

#sort by descending order of total cost
table_df_sum=table_df_sum.sort_values(by=['totalCost'], ascending=False)

#save to csv
table_df_sum.to_csv('costs_report_final.csv', index=True)
3 changes: 3 additions & 0 deletions scripts/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
numpy
pandas
regex
cwl_utils
miniwdl == 1.2.1

Expand Down