📝 update READMEs and add some hints to config files

RasmussenLab · May 30, 2024 · f78f47b · f78f47b
1 parent 89d554b
commit f78f47b
Show file tree

Hide file tree

Showing 4 changed files with 193 additions and 113 deletions.
diff --git a/README.md b/README.md
@@ -1,24 +1,33 @@
 # PIMMS
-[![Read the Docs](https://img.shields.io/readthedocs/pimms)](https://readthedocs.org/projects/pimms/) [![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/RasmussenLab/pimms/ci.yaml)](https://github.com/RasmussenLab/pimms/actions)
+[![Read the Docs](https://img.shields.io/readthedocs/pimms)](https://readthedocs.org/projects/pimms/) [![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/RasmussenLab/pimms/ci.yaml)](https://github.com/RasmussenLab/pimms/actions) [![Documentation Status](https://readthedocs.org/projects/pimms/badge/?version=latest)](https://pimms.readthedocs.io/en/latest/?badge=latest)
 
 
 PIMMS stands for Proteomics Imputation Modeling Mass Spectrometry 
 and is a hommage to our dear British friends 
 who are missing as part of the EU for far too long already
-(Pimms is also a British summer drink).
+(Pimms is a British summer drink).
 
-The pre-print is available [on biorxiv](https://doi.org/10.1101/2023.01.12.523792).
+The publication is accepted in Nature Communications 
+and the pre-print is available [on biorxiv](https://doi.org/10.1101/2023.01.12.523792).
 
 > `PIMMS` was called `vaep` during development.  
-> Before entire refactoring has to been completed the imported package will be
-`vaep`.
+> Before entire refactoring has to been completed the imported package will be `vaep`.
 
-We provide functionality as a python package, an excutable workflow and notebooks.
+We provide functionality as a python package, an excutable workflow or simply in notebooks.
 
-The models can be used with the scikit-learn interface in the spirit of other scikit-learn imputers. You can try this in colab. [![open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)
+For any questions, please [open an issue](https://github.com/RasmussenLab/pimms/issues) or contact me directly.
 
+## Getting started
 
-## Python package
+The models can be used with the scikit-learn interface in the spirit of other scikit-learn imputers. You can try this using our tutorial in colab:
+
+[![open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)
+
+It uses the scikit-learn interface. The PIMMS models in the scikit-learn interface
+can be executed on the entire data or by specifying a valdiation split for checking training process.
+In our experiments overfitting wasn't a big issue, but it's easy to check.
+
+## Install Python package
 
 For interactive use of the models provided in PIMMS, you can use our
 [python package `pimms-learn`](https://pypi.org/project/pimms-learn/).
@@ -28,7 +37,7 @@ The interface is similar to scikit-learn.
 pip install pimms-learn
 ```
 
-Then you can use the models on a pandas DataFrame with missing values. Try this in the tutorial on Colab:
+Then you can use the models on a pandas DataFrame with missing values. You can try this in the tutorial on Colab by uploading your data:
 [![open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)
 
 ## Notebooks as scripts using papermill
@@ -37,13 +46,15 @@ If you want to run a model on your prepared data, you can run notebooks prefixed
 `01_`, i.e. [`project/01_*.ipynb`](https://github.com/RasmussenLab/pimms/tree/HEAD/project) after cloning the repository. Using jupytext also python percentage script versions
 are saved.
 
-```
+```bash
+# navigat to your desired folder
+git clone https://github.com/RasmussenLab/pimms.git # get all notebooks
 cd project # project folder as pwd
+# pip install pimms-learn papermill # if not already installed
 papermill 01_0_split_data.ipynb --help-notebook
 papermill 01_1_train_vae.ipynb --help-notebook
 ```
-
-> Mistyped argument names won't throw an error when using papermill
+> :warning: Mistyped argument names won't throw an error when using papermill, but a warning is printed on the console thanks to my contributions:)
 
 ## PIMMS comparison workflow and differential analysis workflow
 
@@ -55,15 +66,51 @@ It is built on top of
   - the [Snakefile_v2.smk](https://github.com/RasmussenLab/pimms/blob/HEAD/project/workflow/Snakefile_v2.smk) (v2 of imputation workflow), specified in on configuration
   - the [Snakefile_ald_comparision](https://github.com/RasmussenLab/pimms/blob/HEAD/project/workflow/Snakefile_ald_comparison.smk) workflow for differential analysis
 
+The associated notebooks are index with `01_*` for the comparsion workflow and `10_*` for the differential analysis workflow. The `project` folder can be copied separately to any location if the package is installed. It's standalone folder. It's main folders are:
 
+```bash
+# project folder:
+project
+│   README.md # see description of notebooks and hints on execution in project folder
+|---config # configuration files for experiments ("workflows")
+|---data # data for experiments
+|---runs # results of experiments
+|---src # source code or binaries for some R packges
+|---tutorials # some tutorials for libraries used in the project
+|---workflow # snakemake workflows
+```
+
+To re-execute the entire workflow locally, have a look at the [configuration files](https://github.com/RasmussenLab/pimms/tree/HEAD/project/config/alzheimer_study) for the published Alzheimer workflow:
+
+- [`config/alzheimer_study/config.yaml`](https://github.com/RasmussenLab/pimms/blob/HEAD/project/config/alzheimer_study/comparison.yaml)
+- [`config/alzheimer_study/comparsion.yaml`](https://github.com/RasmussenLab/pimms/blob/HEAD/project/config/alzheimer_study/config.yaml)
+
+To execute that workflow, follow the Setup instructions below and run the following command in the project folder:
+
+```bash
+# being in the project folder
+snakemake -s workflow/Snakefile_v2.smk --configfile config/alzheimer_study/config.yaml -p -c1 -n # one core/process, dry-run
+snakemake -s workflow/Snakefile_v2.smk --configfile config/alzheimer_study/config.yaml -p -c2 # two cores/process, execute
+# after imputation workflow, execute the comparison workflow
+snakemake -s workflow/Snakefile_ald_comparison.smk --configfile config/alzheimer_study/comparison.yaml -p -c1
+# If you want to build the website locally: https://www.rasmussenlab.org/pimms/
+pip install .[docs]
+pimms-setup-imputation-comparison -f project/runs/alzheimer_study/
+pimms-add-diff-comp -f project/runs/alzheimer_study/ -sf_cp project/runs/alzheimer_study/diff_analysis/AD
+cd project/runs/alzheimer_study/
+sphinx-build -n --keep-going -b html ./ ./_build/
+# open ./_build/index.html
+```
+
+## Setup workflow and development environment
 
 ### Setup comparison workflow
 
 The core funtionality is available as a standalone software on PyPI under the name `pimms-learn`. However, running the entire snakemake workflow in enabled using 
 conda (or mamba) and pip to setup an analysis environment. For a detailed description of setting up
 conda (or mamba), see [instructions on setting up a virtual environment](https://github.com/RasmussenLab/pimms/blob/HEAD/docs/venv_setup.md).
 
-Download the repository
+Download the repository:
 
 ```
 git clone https://github.com/RasmussenLab/pimms.git
@@ -80,14 +127,14 @@ mamba env create -n pimms -f environment.yml # faster, less then 5mins
 
 If on Mac M1, M2 or having otherwise issue using your accelerator (e.g. GPUs): Install the pytorch dependencies first, then the rest of the environment:
 
-### Install development dependencies
+### Install pytorch first (M-chips)
 
 Check how to install pytorch for your system [here](https://pytorch.org/get-started).
 
 - select the version compatible with your cuda version if you have an nvidia gpu or a Mac M-chip.
 
 ```bash
-conda create -n vaep python=3.8 pip
+conda create -n vaep python=3.9 pip
 conda activate vaep
 # Follow instructions on https://pytorch.org/get-started 
 # conda env update -f environment.yml -n vaep # should not install the rest.
@@ -101,29 +148,17 @@ papermill 04_1_train_pimms_models.ipynb 04_1_train_pimms_models_test.ipynb # sec
 python 04_1_train_pimms_models.py # just execute the code
 ```
 
-### Entire development installation
-
-
-```bash
-conda create -n pimms_dev -c pytorch -c nvidia -c fastai -c bioconda -c plotly -c conda-forge --file requirements.txt --file requirements_R.txt --file requirements_dev.txt
-pip install -e . # other pip dependencies missing
-snakemake --configfile config/single_dev_dataset/example/config.yaml -F -n
-```
-
-or if you want to update an existing environment
-
+### Let Snakemake handle installation
 
-```
-conda update  -c defaults -c conda-forge -c fastai -c bioconda -c plotly --file requirements.txt --file requirements_R.txt --file requirements_dev.txt
-```
+If you only want to execute the workflow, you can use snakemake to build the environments for you:
 
-or using the environment.yml file (can fail on certain systems)
+> Snakefile workflow for imputation v1 only support that atm.
 
-```
-conda env create -f environment.yml
+```bash
+snakemake -p -c1 --configfile config/single_dev_dataset/example/config.yaml --use-conda -n # dry-run
+snakemake -p -c1 --configfile config/single_dev_dataset/example/config.yaml --use-conda # execute with one core
 ```
 
-
 ### Troubleshooting
 
 Trouble shoot your R installation by opening jupyter lab
@@ -133,16 +168,16 @@ Trouble shoot your R installation by opening jupyter lab
 jupyter lab # open 01_1_train_NAGuideR.ipynb
 ```
 
-## Run an analysis
+## Run example
 
 Change to the [`project` folder](./project) and see it's [README](project/README.md)
-You can subselect models by editing the config file:  [`config.yaml`](project/config/single_dev_dataset/proteinGroups_N50/config.yaml) file.
+You can subselect models by editing the config file:  [`config.yaml`](https://github.com/RasmussenLab/pimms/tree/HEAD/project/config/single_dev_dataset/proteinGroups_N50) file.
 
 ```
 conda activate pimms # activate virtual environment
 cd project # go to project folder
 pwd # so be in ./pimms/project
-snakemake -c1 -p -n # dryrun demo workflow
+snakemake -c1 -p -n # dryrun demo workflow, potentiall add --use-conda
 snakemake -c1 -p
 ```
 
@@ -234,7 +269,3 @@ From the brief description in the table the exact procedure is not always clear.
 | MSIMPUTE_MNAR | msImpute          | BIOCONDUCTOR | | Missing not at random algorithm using low rank approximation
 | ~~grr~~       | DreamAI           | -            | Fails to install | Rigde regression 
 | ~~GMS~~       | GMSimpute         | tar file     | Fails on Windows | Lasso model
-
-
-## Build status
-[![Documentation Status](https://readthedocs.org/projects/pimms/badge/?version=latest)](https://pimms.readthedocs.io/en/latest/?badge=latest)
diff --git a/project/config/alzheimer_study/README.md b/project/config/alzheimer_study/README.md
@@ -0,0 +1,10 @@
+# Alzheimer study configuration
+
+For [`workflow/Snakefile_v2.yaml`](https://github.com/RasmussenLab/pimms/blob/HEAD/project/workflow/Snakefile_v2.smk):
+
+- [`config.yaml`](config.yaml)
+- see comments in config for explanations.
+
+For [`workflow/Snakefile_ald_comparison](https://github.com/RasmussenLab/pimms/blob/HEAD/project/workflow/Snakefile_ald_comparison.smk):
+
+- [`comparison.yaml`](comparison.yaml)
diff --git a/project/config/alzheimer_study/config.yaml b/project/config/alzheimer_study/config.yaml
@@ -1,75 +1,79 @@
 # config for Snakefile_v2.smk
-config_split: runs/alzheimer_study/split.yaml # ! will be build
-config_train: runs/alzheimer_study/train_{model}.yaml # ! will be build
-folder_experiment: runs/alzheimer_study
-fn_rawfile_metadata: https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv
-cuda: False
-file_format: csv
-split_data:
-    FN_INTENSITIES: https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/proteome.csv
-    sample_completeness: 0.5
-    feat_prevalence: 0.25
-    column_names:
-        - protein groups
-    index_col: 0
-    meta_cat_col: _collection site
-    meta_date_col: null
-    frac_mnar: 0.25
-    frac_non_train: 0.1
+config_split: runs/alzheimer_study/split.yaml # ! will be build by workflow
+config_train: runs/alzheimer_study/train_{model}.yaml # ! will be build by workflow
+folder_experiment: runs/alzheimer_study # folder to save the results
+fn_rawfile_metadata: https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv # metadata file
+cuda: False # use GPU?
+file_format: csv # intermediate file formats
+split_data: # for 01_01_split_data.ipynb -> check parameters
+      FN_INTENSITIES: https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/proteome.csv
+      sample_completeness: 0.5
+      feat_prevalence: 0.25
+      column_names:
+            - protein groups
+      index_col: 0
+      meta_cat_col: _collection site
+      meta_date_col: null # null if no date column, translated to None in Python
+      frac_mnar: 0.25
+      frac_non_train: 0.1
 models:
-    - Median:
-          model: Median
-    - CF:
-          model: CF
-          latent_dim: 50
-          batch_size: 1024
-          epochs_max: 100
-          sample_idx_position: 0
-          cuda: False
-          save_pred_real_na: True
-    - DAE:
-          model: DAE
-          latent_dim: 10
-          batch_size: 64
-          epochs_max: 300
-          hidden_layers: "64"
-          sample_idx_position: 0
-          cuda: False
-          save_pred_real_na: True
-    - VAE:
-          model: VAE
-          latent_dim: 10
-          batch_size: 64
-          epochs_max: 300
-          hidden_layers: "64"
-          sample_idx_position: 0
-          cuda: False
-          save_pred_real_na: True
-    - KNN:
-          model: KNN
-          neighbors: 3
-          file_format: csv
+      - Median: # name used for model with this configuration
+              model: Median # model used
+      - CF:
+              model: CF # notebook: 01_1_train_{model}.ipynb will be 01_1_train_CF.ipynb
+              latent_dim: 50
+              batch_size: 1024
+              epochs_max: 100
+              sample_idx_position: 0
+              cuda: False
+              save_pred_real_na: True
+      - DAE:
+              model: DAE
+              latent_dim: 10
+              batch_size: 64
+              epochs_max: 300
+              hidden_layers: "64"
+              sample_idx_position: 0
+              cuda: False
+              save_pred_real_na: True
+      - VAE:
+              model: VAE
+              latent_dim: 10
+              batch_size: 64
+              epochs_max: 300
+              hidden_layers: "64"
+              sample_idx_position: 0
+              cuda: False
+              save_pred_real_na: True
+      - KNN:
+              model: KNN
+              neighbors: 3
+              file_format: csv
+      - KNN5:
+              model: KNN
+              neighbors: 5
+              file_format: csv
 NAGuideR_methods:
-    - BPCA
-    - COLMEDIAN
-    - IMPSEQ
-    - IMPSEQROB
-    - IRM
-    - KNN_IMPUTE
-    - LLS
-    # - MICE-CART > 1h20min on GitHub small runner
-    # - MICE-NORM ~ 1h on GitHub small runner
-    - MINDET
-    - MINIMUM
-    - MINPROB
-    - MLE
-    - MSIMPUTE
-    - MSIMPUTE_MNAR
-    - PI
-    - QRILC
-    - RF
-    - ROWMEDIAN
-    # - SEQKNN # Error in x[od, ismiss, drop = FALSE]: subscript out of bounds
-    - SVDMETHOD
-    - TRKNN
-    - ZERO
+      - BPCA
+      - COLMEDIAN
+      - IMPSEQ
+      - IMPSEQROB
+      - IRM
+      - KNN_IMPUTE
+      - LLS
+      # - MICE-CART > 1h20min on GitHub small runner
+      # - MICE-NORM ~ 1h on GitHub small runner
+      - MINDET
+      - MINIMUM
+      - MINPROB
+      - MLE
+      - MSIMPUTE
+      - MSIMPUTE_MNAR
+      - PI
+      - QRILC
+      - RF
+      - ROWMEDIAN
+      # - SEQKNN # Error in x[od, ismiss, drop = FALSE]: subscript out of bounds
+      - SVDMETHOD
+      - TRKNN
+      - ZERO