-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding cost script to create a costs summary #26
Open
ksinghal28
wants to merge
4
commits into
wustl-oncology:main
Choose a base branch
from
ksinghal28:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
007cd36
KS-added cost script and changed dockerfile and requirements file to …
ksinghal28 88f6e6e
Updating README for cost_script.py addition
ksinghal28 8fa79c9
Updating to make regex command cleaner
ksinghal28 1e1d152
Updating README verbiage to be clearer
ksinghal28 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
#!/usr/bin/env python3 | ||
|
||
""" | ||
Converts costs TSV file to summary costs TSV file | ||
|
||
Usage: cost_script.py [costs tsv file] | ||
""" | ||
#Import modules | ||
import sys | ||
import pandas as pd | ||
import regex as re | ||
|
||
file=sys.argv[1] | ||
|
||
#initialize list called table where we'll store all the values from tsv | ||
table = [] | ||
with open(file) as f: | ||
for line in f: | ||
L = line.split('\t') #split by tab | ||
table.append(L) | ||
|
||
#delete anything that resembles 'shard' followed by a number. If there's a 3-digit shard at some point, add a re.sub(\d\d\d) line before the re.sub(\d\d) line. | ||
#have to go in descending order of numerical digits because otherwise will delete shard# and be left with name-of-task# which won't get deleted. | ||
for i in table: | ||
ksinghal28 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if "shard" in i[0]: | ||
if "retry" in i[0]: | ||
# print("retry",i[0]) | ||
i[0] = re.sub('_shard-\d\d','',i[0]) | ||
i[0] = re.sub('_shard-\d','',i[0]) | ||
# print(i[0]) | ||
else: | ||
# print("no retry",i[0]) | ||
i[0] = re.sub('_shard-\d\d','',i[0]) | ||
i[0] = re.sub('_shard-\d','',i[0]) | ||
# print(i[0]) | ||
|
||
|
||
#convert list of lists to pandas dataframe using first list item as header. Grab specific columns we want. Drop the first row because it's just the list of column names | ||
table_df = pd.DataFrame(table, columns=table[0]) | ||
table_df = table_df[["callName","totalCost","cpuCost","memoryCost","diskCost"]] | ||
table_df=table_df.drop([0]) | ||
|
||
#convert all numerical values from strings to floats | ||
table_df = table_df.astype({'totalCost':'float','cpuCost':'float','memoryCost':'float','diskCost':'float'}) | ||
|
||
#sum all rows with same callname | ||
table_df_sum = table_df.groupby("callName").sum() | ||
|
||
#sort by descending order of total cost | ||
table_df_sum=table_df_sum.sort_values(by=['totalCost'], ascending=False) | ||
|
||
#save to csv | ||
table_df_sum.to_csv('costs_report_final.csv', index=True) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,6 @@ | ||
numpy | ||
pandas | ||
regex | ||
cwl_utils | ||
miniwdl == 1.2.1 | ||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested edit:
`Takes the output of costs_json_to_csv.py and collapses tasks that have been rerun (due to failure or premption) and those that have been split into shards, giving one cost for the entire task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, I only collapsed tasks that had been split into shards, and not ones that were recorded as 'retry'.
So, if there were rows as-
doBqsr.bqsr_shard-14_retry1
doBqsr.bqsr_shard-14
doBqsr.bqsr_shard-13_retry1
doBqsr.bqsr_shard-13
I collapsed them into-
doBqsr.bqsr
and
doBqsr.bqsr_retry1
That said, if we want to collapse all 4 of those into doBqsr.bqsr I can change the code to do that.
I'll also edit my comment with what you mentioned and make it clearer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, I think upon further reflection, keeping retries separate is probably the right move here, so yeah, you got it right. Nice work!