-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev #52
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
create a new intermediate dump of 7,444 HeLa runs
- KNN dumps val and test data with specified "args.model_key" in "config.yaml" - update color palette for "unknown" models - make performance_plots.py more robust - training configs are created and saved on the fly (-> avoid separate model configs, collect all in one) R methods are fixed, no customization so far. To do this one would probably need to generate separte NBs for each method.
based on Lazar et. al. (2016) - below a quantile -> MNAR, select from there - quantile is defined based on overall frac of missing values - mix MCAR and MNAR - format and clean-up code in script
- refactoring error -> select correct data
- only test CF, DAE and VAE functionally - select configs in example folder...
- both scripts (notebooks) - and library code
- msImpute - trKNN (from source) Add to workflow check.
- start grouping output for an easier overview (than only alphabetical)
- update depreciated functionality in pandas -> some scripts might have further depreciation warnings
- igraph installation in conda on the fly fails for windows otherwise: https://stackoverflow.com/a/71711600/9684872
- reversed decoy sequence matches should be removed (it's only a few)
- grouping of plots was not reflected in Snakemake workflow
- aim: specify long run time for R jobs with a high max - run long running job in parallel on one big node
- log file paths for submitted jobs added (should be unique) - -V: forward set environment for submitted job
- precursors from reversed protein sequences are removed from the evidence table - adapt code to use local information (yaml files)
- colab uses pandas and pytorch two - datetime_is_numeric parameter removed from describe, see https://pandas.pydata.org/docs/whatsnew/v2.0.0.html
- append is depreciated.
in case a tool, e.g. the torque scheduler, creates log files, these can be requested per task (job): in the run_snakememake_cluster bash script, this is done using -e and -o options.
- submit required parameters using the -v option, e.g. qsub run_snakemake_cluster.sh \ -N snakemake_exp0 \ -v configfile=path_to/config.yaml,prefix=exp0
- rename also protein groups and precursors (evidence) dumps - drop entries from reversed sequences in evidence files
- increase robustness of notebook, ignoring all NA methods (here: IMPSEQ) - To consider: should 01_1_train_NAGuideR.ipynb throw an error if all pred are NAs?
- function loading and filtering data - add IDs making it possible to make precursors (Evidence IDs), Peptide ID and Protein Groups IDs to each other. in a file the id column is always "id" (e.g. proteinGroups.txt id column = Protein Groups IDs in the other two)
- tbc: see what works Next: merge with version where parameters for python based models can be set in config.yaml
Filter reversed -> parts for collecting data will be factored out
🚧 prepare cluster execution - default: CPU execution, not accelerated (e.g. GPU) - job script for torque cluster - logs with notebook outputs
⬆️ remove constraints on pandas and pytorch -> faster setup on google collab - less constraints on version
-> https://github.com/RasmussenLab/hela_qc_mnt_data commit link: RasmussenLab/hela_qc_mnt_data@f88586b - make minor adaption needed due to deletions
🔥 move hela data collection code to new repo: https://github.com/RasmussenLab/hela_qc_mnt_data
- create individual logs for nb execution -> separate files on local execution -> documentation of how long training step took
- config dict has to be copied. Otherwise value None is not dumped as null: Before: - column_names: "None" Now: - column_names: null
- one or two features have with 50 samples less than 4 intensities in training data split -> move the validation data for these to the training split
- new dataset balancing between GSIMP runtime and SEQKNN need for a minimum number of features - run each method one by one (avoid race conditions when installing, only a problem on first time setup)
- is GSIMP fast enough (227-> ~1h)? - probably test GSIMP here once, then remove from "fast testing" workflow
remove warnings thrown by papermill
- update defaults to results from small grid search (smallest of top 3)
document also qsub command and update submission script
(add more models) - needs to be completed and cleaned-up
- rather "bigger" batches with more training steps - update Fig. 2 plots generation to 25MNAR
Methods: - added GSimp. - reduced the dimensionality of the example data in the GitHub Action so GSimp finishes (~1h) -> does not scale - MNAR algorithm of MSIMPUTE added Data: - ensure that training data has at least 4 samples (MSIMPUTE includes that check) - Formatted and updated workflow configs and declarations (v1&v2). Added script for command creation
- Figure 2: add custom selection of models to aggregate best 5 models of several datasets (custom plotting for paper) - rotate performance label - add NA if model did not run (here: error or not finished within 24h)
- for large pep and evi, the top five are already the correct set
- for subselected models the colors were not reselected
- based on seaborn example of _ColorPalette
improve readability
- tables for Supp. Data - update plots (fontsize, support)
- use a share of 25% MNAR in removed data - use a share of 25% MNAR in comparison - update figures for publication (names, label, fontsize, etc)
- dump config
- 🐛 remove metadata fpath from train_X.yaml - also run KNN comp. with workflow v2 with a share of 25MNAR
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.