Skip to content

Commit

Permalink
✨ test Snakemake workflow with more recent Python versions (#66)
Browse files Browse the repository at this point in the history
* ✨ test Snakemake workflow with more recent Python versions

* ✨ remove snakemake upper limit

* 🐛 bump plac version

related to : snakemake/snakemake#2276

* 📌 update actions, make artefacts unique

- check if local windows error with pandas can be reproduced in action (corr)

* 🎨 dump counts for histograms

- add for simulated missing values
- remove duplication
- 🐛 do not omit last bin

* 🐛 assure that some values are set to NA

if all values are higher than the default threshold, the assertion on L17 is not met. Make sure some NAs (missing values) are set.

* 🎨 write out corr and prepare for pandas 2.0

- see if this works also with pandas 1.5.3

* 🔧 test relaxing pandas restriction

* 🐛 drop batches with one sample for training DAE and VAE

- for creating the latent representation, now a new DataLoader is needed.

* ✨ splitup large global environement

- separate environment for PIMMS models and R based models
- global environment should still work

* 🐛 test if adding jupyter is sufficient to install further packages in R sesssion

- only execute one job at a time in retry to see errors better

* 🐛 fix sampling to make it compatible with python >=3.11

* ⬆️ remove pytorch upper dependency

* ✅ Test workflow v2 on Alzheimer dataset

- once this passes, add ald analysis to website (for a reasonable subset of models)
- maybe only showcase PIMMS models with a handful of other models

* 🐛 update path to execute run, speed-up

- also remove two slowest models

* 🎨 hide code in rendered notebooks of workflow, sort imports

- hide code cells for generated report
- isort imports

* ✨ Functionality for plot source data (ALD study)

- add some functionality required to collect source data for reporting on saved figures.

* ✨ Run differential analysis workflow in CI on Alzheimer data

- several adaptions to slightly different design between ALD and Alzheimer data

* 🐛 specify folder_experiment from global space

- ... and not as wildcard

* 🎨 rename Snakefile_v2 to Snakefile_v2.smk

- uncommon names should have a file ending specifying Snakefiles.

* ✨ skript to build website (execution)

- execution should work, but subfolders need their own index.rst
- need to adapt script for updating main index.rst

See if everything runs for now.

* 🐛 do not exclude diff analysis folder in conf.py

* 🎨🐛 make a strict hierarchy of headings per document

- mapping titles in sphinx (cross-referencing) otherwise does not work

* 🎨 collapse code in published notebooks

- for better inspection of generated report for example

* 🎨 annotate notebooks

add some comments and streamline cells.

* ✨ Test tutorial on colab

* 🐛🎨 format and check briefly colab workflow on dev branch

* 🎨 hide more inputs, downscale tutorial runner

* 📝 Update README

- 🐛 use larger image to test tutorial on colab

* 📝 update READMEs and add some hints to config files

* 📝🎨 save some adhoc script used during revisons, add and cleanup nb list

* 🐛 go back old config indentation (and model configuration)

- rerun in codespace for inspection

* 🐛 fix issue having same model with 2 configurations

- had to set model id ("model_key") as index

* 📝✨ Allow users to download large HeLa protein groups dataset easily
  • Loading branch information
Henry Webel authored May 31, 2024
1 parent 8706c61 commit 7377441
Show file tree
Hide file tree
Showing 71 changed files with 5,178 additions and 2,182 deletions.
9 changes: 4 additions & 5 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@ jobs:
"macos-13",
# "windows-latest" # rrcovNA cannot be build from source on windows-server
]
python-version: ["3.8"]
python-version: ["3.8", "3.9", "3.10"]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Miniconda
# ! change action https://github.com/mamba-org/setup-micromamba
uses: conda-incubator/setup-miniconda@v2
uses: conda-incubator/setup-miniconda@v3
with:
miniforge-variant: Mambaforge
# miniforge-version: latest
Expand Down Expand Up @@ -82,9 +82,9 @@ jobs:
snakemake -p -c4 -k --configfile config/single_dev_dataset/example/config.yaml
- name: Archive results
# https://github.com/actions/upload-artifact
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: example-workflow-results-${{ matrix.os }}
name: ${{ matrix.os }}-${{ matrix.python-version }}-example-workflow-results
path: |
project/runs/example/
environment.yml
Expand Down Expand Up @@ -114,7 +114,6 @@ jobs:
- name: Run pytest
run: pytest .


publish:
name: Publish package
if: startsWith(github.ref, 'refs/tags')
Expand Down
55 changes: 55 additions & 0 deletions .github/workflows/ci_workflow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
name: run workflow with conda envs
on:
push:
branches: [main, dev]
pull_request:
branches: [main, dev]
release:
# schedule:
# - cron: '0 2 * * 3,6'
jobs:
run-integration-tests-with-conda-install:
runs-on: ${{ matrix.os }}
defaults:
run:
shell: bash -el {0}
strategy:
fail-fast: false
matrix:
os: [
"ubuntu-latest",
"macos-13",
# "windows-latest" # rrcovNA cannot be build from source on windows-server
]
python-version: ["3.10"]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Miniconda
# ! change action https://github.com/mamba-org/setup-micromamba
uses: conda-incubator/setup-miniconda@v3
with:
miniforge-variant: Mambaforge
use-mamba: true
channel-priority: disabled
python-version: ${{ matrix.python-version }}
environment-file: snakemake_env.yml
activate-environment: snakemake
auto-activate-base: true
- name: inspect-conda-environment
run: |
conda info
conda list
- name: Dry-run workflow
run: |
cd project
snakemake -p -c1 --configfile config/single_dev_dataset/example/config.yaml -n --use-conda
- name: Run demo workflow (integration test)
continue-on-error: true
run: |
cd project
snakemake -p -c4 -k --configfile config/single_dev_dataset/example/config.yaml --use-conda
- name: Run demo workflow again (in case of installation issues)
run: |
cd project
snakemake -p -c1 -k --configfile config/single_dev_dataset/example/config.yaml --use-conda
26 changes: 26 additions & 0 deletions .github/workflows/test_pkg_on_colab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Test that tutorial runs on latest colab image

on:
push:
branches: [dev]
pull_request:
branches: [main, dev]
schedule:
- cron: '0 2 3 * *'

jobs:
test-tutorial-on-colab:
name: Test tutorial on latest colab image
runs-on: ubuntu-latest-4core # increase disk space
# https://console.cloud.google.com/artifacts/docker/colab-images/europe/public/runtime
container:
image: europe-docker.pkg.dev/colab-images/public/runtime:latest
steps:
- uses: actions/checkout@v4
- name: Install pimms-learn and papermill
run: |
python3 -m pip install pimms-learn papermill
- name: Run tutorial
run: |
cd project
papermill 04_1_train_pimms_models.ipynb 04_1_train_pimms_models_output.ipynb
27 changes: 17 additions & 10 deletions .github/workflows/workflow_website.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Build workflow website on smaller development dataset (for protein groups)
name: Build workflow website on public Alzheimer dataset (for protein groups)
on:
pull_request:
branches: [main, dev]
Expand Down Expand Up @@ -29,32 +29,39 @@ jobs:
activate-environment: vaep
auto-activate-base: true
# auto-update-conda: true
- name: Dry-run workflow
run: |
cd project
snakemake -s workflow/Snakefile_v2.smk --configfile config/alzheimer_study/config.yaml -p -c1 -n
- name: Run demo workflow (integration test)
continue-on-error: true
run: |
cd project
snakemake -p -c1 -n
snakemake -p -c4 -k
snakemake -s workflow/Snakefile_v2.smk --configfile config/alzheimer_study/config.yaml -p -c4 -k
- name: Run demo workflow again (in case of installation issues)
run: |
cd project
snakemake -p -c1 -n
snakemake -p -c4 -k
snakemake -s workflow/Snakefile_v2.smk --configfile config/alzheimer_study/config.yaml -p -c4 -k
- name: Run differential analysis workflow
run: |
cd project
snakemake -s workflow/Snakefile_ald_comparison.smk --configfile config/alzheimer_study/comparison.yaml -p -c4
- name: Install website dependencies
run: |
pip install .[docs]
- name: Build imputation comparison website
run: |
pimms-setup-imputation-comparison -f project/runs/dev_dataset_small/proteinGroups_N50/
cd project/runs/dev_dataset_small/proteinGroups_N50/
pimms-setup-imputation-comparison -f project/runs/alzheimer_study/
pimms-add-diff-comp -f project/runs/alzheimer_study/ -sf_cp project/runs/alzheimer_study/diff_analysis/AD
cd project/runs/alzheimer_study/
sphinx-build -n --keep-going -b html ./ ./_build/
- name: Archive results
uses: actions/upload-artifact@v3
with:
name: example-workflow-results-${{ matrix.os }}
path: project/runs/dev_dataset_small/proteinGroups_N50/_build/
name: alzheimer-study
path: project/runs/alzheimer_study/
- name: Publish workflow as website
uses: peaceiris/actions-gh-pages@v4
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: project/runs/dev_dataset_small/proteinGroups_N50/_build/
publish_dir: project/runs/alzheimer_study/_build/
121 changes: 79 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,33 @@
# PIMMS
[![Read the Docs](https://img.shields.io/readthedocs/pimms)](https://readthedocs.org/projects/pimms/) [![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/RasmussenLab/pimms/ci.yaml)](https://github.com/RasmussenLab/pimms/actions)
[![Read the Docs](https://img.shields.io/readthedocs/pimms)](https://readthedocs.org/projects/pimms/) [![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/RasmussenLab/pimms/ci.yaml)](https://github.com/RasmussenLab/pimms/actions) [![Documentation Status](https://readthedocs.org/projects/pimms/badge/?version=latest)](https://pimms.readthedocs.io/en/latest/?badge=latest)


PIMMS stands for Proteomics Imputation Modeling Mass Spectrometry
and is a hommage to our dear British friends
who are missing as part of the EU for far too long already
(Pimms is also a British summer drink).
(Pimms is a British summer drink).

The pre-print is available [on biorxiv](https://doi.org/10.1101/2023.01.12.523792).
The publication is accepted in Nature Communications
and the pre-print is available [on biorxiv](https://doi.org/10.1101/2023.01.12.523792).

> `PIMMS` was called `vaep` during development.
> Before entire refactoring has to been completed the imported package will be
`vaep`.
> Before entire refactoring has to been completed the imported package will be `vaep`.
We provide functionality as a python package, an excutable workflow and notebooks.
We provide functionality as a python package, an excutable workflow or simply in notebooks.

The models can be used with the scikit-learn interface in the spirit of other scikit-learn imputers. You can try this in colab. [![open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)
For any questions, please [open an issue](https://github.com/RasmussenLab/pimms/issues) or contact me directly.

## Getting started

## Python package
The models can be used with the scikit-learn interface in the spirit of other scikit-learn imputers. You can try this using our tutorial in colab:

[![open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)

It uses the scikit-learn interface. The PIMMS models in the scikit-learn interface
can be executed on the entire data or by specifying a valdiation split for checking training process.
In our experiments overfitting wasn't a big issue, but it's easy to check.

## Install Python package

For interactive use of the models provided in PIMMS, you can use our
[python package `pimms-learn`](https://pypi.org/project/pimms-learn/).
Expand All @@ -28,7 +37,7 @@ The interface is similar to scikit-learn.
pip install pimms-learn
```

Then you can use the models on a pandas DataFrame with missing values. Try this in the tutorial on Colab:
Then you can use the models on a pandas DataFrame with missing values. You can try this in the tutorial on Colab by uploading your data:
[![open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RasmussenLab/pimms/blob/HEAD/project/04_1_train_pimms_models.ipynb)

## Notebooks as scripts using papermill
Expand All @@ -37,27 +46,71 @@ If you want to run a model on your prepared data, you can run notebooks prefixed
`01_`, i.e. [`project/01_*.ipynb`](https://github.com/RasmussenLab/pimms/tree/HEAD/project) after cloning the repository. Using jupytext also python percentage script versions
are saved.

```
```bash
# navigat to your desired folder
git clone https://github.com/RasmussenLab/pimms.git # get all notebooks
cd project # project folder as pwd
# pip install pimms-learn papermill # if not already installed
papermill 01_0_split_data.ipynb --help-notebook
papermill 01_1_train_vae.ipynb --help-notebook
```
> :warning: Mistyped argument names won't throw an error when using papermill, but a warning is printed on the console thanks to my contributions:)
> Mistyped argument names won't throw an error when using papermill
## PIMMS comparison workflow
## PIMMS comparison workflow and differential analysis workflow

The PIMMS comparison workflow is a snakemake workflow that runs the all selected PIMMS models and R-models on
a user-provided dataset and compares the results. An example for the smaller HeLa development dataset on the
a user-provided dataset and compares the results. An example for a publickly available Alzheimer dataset on the
protein groups level is re-built regularly and available at: [rasmussenlab.org/pimms](https://www.rasmussenlab.org/pimms/)

It is built on top of
- the [Snakefile_v2.smk](https://github.com/RasmussenLab/pimms/blob/HEAD/project/workflow/Snakefile_v2.smk) (v2 of imputation workflow), specified in on configuration
- the [Snakefile_ald_comparision](https://github.com/RasmussenLab/pimms/blob/HEAD/project/workflow/Snakefile_ald_comparison.smk) workflow for differential analysis

The associated notebooks are index with `01_*` for the comparsion workflow and `10_*` for the differential analysis workflow. The `project` folder can be copied separately to any location if the package is installed. It's standalone folder. It's main folders are:

```bash
# project folder:
project
│ README.md # see description of notebooks and hints on execution in project folder
|---config # configuration files for experiments ("workflows")
|---data # data for experiments
|---runs # results of experiments
|---src # source code or binaries for some R packges
|---tutorials # some tutorials for libraries used in the project
|---workflow # snakemake workflows
```

To re-execute the entire workflow locally, have a look at the [configuration files](https://github.com/RasmussenLab/pimms/tree/HEAD/project/config/alzheimer_study) for the published Alzheimer workflow:

- [`config/alzheimer_study/config.yaml`](https://github.com/RasmussenLab/pimms/blob/HEAD/project/config/alzheimer_study/comparison.yaml)
- [`config/alzheimer_study/comparsion.yaml`](https://github.com/RasmussenLab/pimms/blob/HEAD/project/config/alzheimer_study/config.yaml)

To execute that workflow, follow the Setup instructions below and run the following command in the project folder:

```bash
# being in the project folder
snakemake -s workflow/Snakefile_v2.smk --configfile config/alzheimer_study/config.yaml -p -c1 -n # one core/process, dry-run
snakemake -s workflow/Snakefile_v2.smk --configfile config/alzheimer_study/config.yaml -p -c2 # two cores/process, execute
# after imputation workflow, execute the comparison workflow
snakemake -s workflow/Snakefile_ald_comparison.smk --configfile config/alzheimer_study/comparison.yaml -p -c1
# If you want to build the website locally: https://www.rasmussenlab.org/pimms/
pip install .[docs]
pimms-setup-imputation-comparison -f project/runs/alzheimer_study/
pimms-add-diff-comp -f project/runs/alzheimer_study/ -sf_cp project/runs/alzheimer_study/diff_analysis/AD
cd project/runs/alzheimer_study/
sphinx-build -n --keep-going -b html ./ ./_build/
# open ./_build/index.html
```

## Setup workflow and development environment

### Setup comparison workflow

The core funtionality is available as a standalone software on PyPI under the name `pimms-learn`. However, running the entire snakemake workflow in enabled using
conda (or mamba) and pip to setup an analysis environment. For a detailed description of setting up
conda (or mamba), see [instructions on setting up a virtual environment](https://github.com/RasmussenLab/pimms/blob/HEAD/docs/venv_setup.md).

Download the repository
Download the repository:

```
git clone https://github.com/RasmussenLab/pimms.git
Expand All @@ -74,14 +127,14 @@ mamba env create -n pimms -f environment.yml # faster, less then 5mins

If on Mac M1, M2 or having otherwise issue using your accelerator (e.g. GPUs): Install the pytorch dependencies first, then the rest of the environment:

### Install development dependencies
### Install pytorch first (M-chips)

Check how to install pytorch for your system [here](https://pytorch.org/get-started).

- select the version compatible with your cuda version if you have an nvidia gpu or a Mac M-chip.

```bash
conda create -n vaep python=3.8 pip
conda create -n vaep python=3.9 pip
conda activate vaep
# Follow instructions on https://pytorch.org/get-started
# conda env update -f environment.yml -n vaep # should not install the rest.
Expand All @@ -95,29 +148,17 @@ papermill 04_1_train_pimms_models.ipynb 04_1_train_pimms_models_test.ipynb # sec
python 04_1_train_pimms_models.py # just execute the code
```

### Entire development installation


```bash
conda create -n pimms_dev -c pytorch -c nvidia -c fastai -c bioconda -c plotly -c conda-forge --file requirements.txt --file requirements_R.txt --file requirements_dev.txt
pip install -e . # other pip dependencies missing
snakemake --configfile config/single_dev_dataset/example/config.yaml -F -n
```

or if you want to update an existing environment
### Let Snakemake handle installation

If you only want to execute the workflow, you can use snakemake to build the environments for you:

```
conda update -c defaults -c conda-forge -c fastai -c bioconda -c plotly --file requirements.txt --file requirements_R.txt --file requirements_dev.txt
```
> Snakefile workflow for imputation v1 only support that atm.
or using the environment.yml file (can fail on certain systems)

```
conda env create -f environment.yml
```bash
snakemake -p -c1 --configfile config/single_dev_dataset/example/config.yaml --use-conda -n # dry-run
snakemake -p -c1 --configfile config/single_dev_dataset/example/config.yaml --use-conda # execute with one core
```


### Troubleshooting

Trouble shoot your R installation by opening jupyter lab
Expand All @@ -127,16 +168,16 @@ Trouble shoot your R installation by opening jupyter lab
jupyter lab # open 01_1_train_NAGuideR.ipynb
```

## Run an analysis
## Run example

Change to the [`project` folder](./project) and see it's [README](project/README.md)
You can subselect models by editing the config file: [`config.yaml`](project/config/single_dev_dataset/proteinGroups_N50/config.yaml) file.
You can subselect models by editing the config file: [`config.yaml`](https://github.com/RasmussenLab/pimms/tree/HEAD/project/config/single_dev_dataset/proteinGroups_N50) file.

```
conda activate pimms # activate virtual environment
cd project # go to project folder
pwd # so be in ./pimms/project
snakemake -c1 -p -n # dryrun demo workflow
snakemake -c1 -p -n # dryrun demo workflow, potentiall add --use-conda
snakemake -c1 -p
```

Expand Down Expand Up @@ -228,7 +269,3 @@ From the brief description in the table the exact procedure is not always clear.
| MSIMPUTE_MNAR | msImpute | BIOCONDUCTOR | | Missing not at random algorithm using low rank approximation
| ~~grr~~ | DreamAI | - | Fails to install | Rigde regression
| ~~GMS~~ | GMSimpute | tar file | Fails on Windows | Lasso model


## Build status
[![Documentation Status](https://readthedocs.org/projects/pimms/badge/?version=latest)](https://pimms.readthedocs.io/en/latest/?badge=latest)
Loading

0 comments on commit 7377441

Please sign in to comment.