diff --git a/lectures/putting-it-together/NBISweden_project_template.png b/lectures/putting-it-together/images/NBISweden_project_template.png similarity index 100% rename from lectures/putting-it-together/NBISweden_project_template.png rename to lectures/putting-it-together/images/NBISweden_project_template.png diff --git a/lectures/putting-it-together/calvin_hobbes_past_corresponding.png b/lectures/putting-it-together/images/calvin_hobbes_past_corresponding.png similarity index 100% rename from lectures/putting-it-together/calvin_hobbes_past_corresponding.png rename to lectures/putting-it-together/images/calvin_hobbes_past_corresponding.png diff --git a/lectures/putting-it-together/my_project_repo.png b/lectures/putting-it-together/images/my_project_repo.png similarity index 100% rename from lectures/putting-it-together/my_project_repo.png rename to lectures/putting-it-together/images/my_project_repo.png diff --git a/lectures/putting-it-together/my_project_repo_2.png b/lectures/putting-it-together/images/my_project_repo_2.png similarity index 100% rename from lectures/putting-it-together/my_project_repo_2.png rename to lectures/putting-it-together/images/my_project_repo_2.png diff --git a/lectures/putting-it-together/whats_in_it_for_me.png b/lectures/putting-it-together/images/whats_in_it_for_me.png similarity index 100% rename from lectures/putting-it-together/whats_in_it_for_me.png rename to lectures/putting-it-together/images/whats_in_it_for_me.png diff --git a/lectures/putting-it-together/putting-it-together.Rmd b/lectures/putting-it-together/putting-it-together.Rmd deleted file mode 100644 index 553b22c0..00000000 --- a/lectures/putting-it-together/putting-it-together.Rmd +++ /dev/null @@ -1,322 +0,0 @@ ---- -title: "Putting it all together" -subtitle: "Tools for Reproducible Research NBIS course" -output: - xaringan::moon_reader: - self-contained: true - seal: false - css: ["default", "../template.css"] - nature: - slideNumberFormat: "" ---- - -```{r Setup, include = FALSE} -# Chunk options -knitr::opts_chunk$set(include = FALSE, - echo = FALSE) -``` - -layout: true - - - ---- - -class: center, middle -.HUGE[Putting it all together] - ---- - -class: center, middle - -Take control of your research project
-by making its different components reproducible - - - ---- - -class: center, middle - -By working reproducibly you will also make your life a lot easier! - - - ---- - -# What have we learned? - -
- -* How to use the version control system .green[Git] to track changes to code -* How to use the package and environment manager .green[Conda] -* How to use the workflow managers .green[Snakemake] and .green[Nextflow] -* How to use .green[R Markdown] and .green[Jupyter] to generate automated reports and to document your analyses -* How to use .green[Docker] and .green[Singularity] to distribute containerized - computational environments - ---- - -# Divide your work into distinct projects - --- - -* Keep all .green[files] needed to go from raw data to final results in a dedicated directory - --- - -* Use relevant .green[subdirectories] - --- - -* Use .green[Git] to version control your projects - --- - -* Do not store data and results/output in your Git repository - --- - -* When in doubt, commit often instead of seldom - ---- - -# Find your own project structure - -.pull-left[ -.small[An example .green[Snakemake]-based project:] -```no-highlight -project/ - ├── code/ - ├── data/ - │ ├── meta/ - │ ├── raw_external/ - │ └── raw_internal/ - ├── doc/ - ├── intermediate/ - ├── logs/ - ├── notebooks/ - │ └── Untitled.ipynb - ├── results/ - │ ├── figures/ - │ ├── reports/ - │ └── tables/ - ├── scratch/ - ├── .gitignore - ├── config.yml - ├── environment.yml - ├── Dockerfile - ├── README.md - └── Snakefile -``` -] - -.pull-right[ -.small[An example .green[Nextflow]-based project:] -```no-highlight -project/ - ├── bin/ - │ └── report.qmd - ├── data/ - │ └── metadata.csv - ├── doc/ - ├── env/ - │ ├── Dockerfile - │ └── environment.yml - ├── results/ - ├── .gitignore - ├── main.nf - ├── nextflow.config - └── README.md -``` -] - - * https://github.com/NBISweden/project_template - * https://github.com/fasterius/nbis-support-template - * https://github.com/snakemake-workflows/snakemake-workflow-template - ---- - -# Treasure your data - --- - -* Keep your input data .green[read-only] - consider it static - --- - -* Don't create different versions of the input data - write a .green[script], - .green[R Markdown] document, .green[Jupyter] notebook or a .green[Snakemake] - / .green[Nextflow] workflow if you need to pre-process your input data so that - the steps can be recreated - --- - -* .green[Backup]! Keep redundant copies in different physical locations - --- - -* Upload your raw data as soon as possible to a .green[public data repository] - ---- - -# Organize your coding - --- - -* Avoid generating files .green[interactively] or doing things .green[by hand] - - - There is no way to track how they were made - --- - -* Write .green[scripts], .green[R Markdown] documents, .green[Jupyter] - notebooks or .green[Snakemake] / .green[Nextflow] workflows for reproducible - results to connect raw data to final results - --- - -* Keep the .green[parameters] separate (*e.g.* at top of file or in a separate configuration file) - ---- - -# What is reasonable for your project? - --- - -.pull-left[
- -.green[Minimal]: write code in a reproducible way and track your environment -] - --- - -.pull-right[ -* Track your projects with a .green[Git] repository each; publish code along - with your results on *e.g.* - -* Use .green[Conda] to install software in environments that can be exported and - installed on a different system; also publish your `environment.yml` file - along with your code -] - ---- - -# What is reasonable for your project? - -.pull-left[
- -.green[Minimal]: write code in a reproducible way and track your environment - -.green[Good]: structure and document your code with notebooks -] - --- - -.pull-right[ -* Use .green[R Markdown] or .green[Jupyter] notebooks to better keep track of - and document your code -] - ---- - -# What is reasonable for your project? - -.pull-left[
- -.green[Minimal]: write code in a reproducible way and track your environment - -.green[Good]: structure and document your code with notebooks - -.green[Great]: track the _full_ environment and connect your code in a workflow -] - --- - -.pull-right[ -* Go one step beyond in tracking your environment using .green[Docker] and/or - .green[Singularity/Apptainer] - -* Convert your code into a .green[Snakemake] / .green[Nextflow] workflow -] - ---- - -# Alternatives - --- - -.green[Version control] - -* **Git** – Widely used and a lot of tools available + GitHub/BitBucket. -* **Mercurial** – Distributed model just like Git, close to sourceforge. -* **Subversion** – Centralized model unlike git/mercurial; no local repository on your computer and somewhat easier to use. - ---- - -# Alternatives - -.green[Environment / package managers] - -* **Conda** – General purpose environment and package manager. Community-hosted collections of tools at bioconda or conda-forge. -* **Pip** – Package manager for Python, has a large repository at pypi. -* **Apt/yum/brew** – Native package managers for different OS. Integrated in OS and might deal with *e.g.* update notifications better. -* **Virtualenv** – Environment manager used to set up semi-isolated python environments. - ---- - -# Alternatives - -.green[Workflow managers] - -* **Snakemake** – Based on Python, easily understandable format, relies on file names. -* **Nextflow** – Based on Groovy, uses data pipes rather than file names to construct the workflow. -* **Make** – Used in software development and has been around since the 70s. Flexible but notoriously obscure syntax. -* **Galaxy** - attempts to make computational biology accessible to researchers without programming experience by using a GUI. - ---- - -# Alternatives - -.green[Literate programming] - -* **Jupyter** – Create and share notebooks in a variety of languages and formats - by using a web browser. -* **R Markdown** – Developed by Posit (previously Rstudio), focused on - generating high-quality documents. -* **Quarto** - Dveloped by Posit (previously RStudio), command-line tool focused - on generating high-quality documents in a language-agnostic way -* **Zeppelin** – Developed by Apache. Closely integrated with Spark for - distributed computing and Big Data applications. -* **Beaker** – Newcomer based on Ipython, just as Jupyter. Has a focus on - integrating multiple languages in the same notebook. - ---- - -# Alternatives - -.green[Containerization / virtualization] - -* **Docker** – Used for packaging and isolating applications in containers. Dockerhub allows for convenient sharing. Requires root access. -* **Singularity/Apptainer** – Simpler Docker alternative geared towards high performance computing. Does not require root. -* **Podman** - open source daemonless container tool similar to docker in many regards -* **Shifter** – Similar ambition as Singularity, but less focus on mobility and more on resource management. -* **VirtualBox/VMWare** – Virtualization rather than containerization. Less lightweight, but no reliance on host kernel. - ---- - -class: center, middle - -# "What's in it for me?" - - - ---- - -# NBIS Bioinformatics drop-in - -Any questions related to reproducible research tools and concepts? Talk to an NBIS expert! - -* Online (.green[Zoom]) -* Every .green[Tuesday, 14.00-15.00] (except public holidays) -* Check .green[www.nbis.se/events] for Zoom link and more info diff --git a/lectures/putting-it-together/putting-it-together.html b/lectures/putting-it-together/putting-it-together.html new file mode 100644 index 00000000..e278a234 --- /dev/null +++ b/lectures/putting-it-together/putting-it-together.html @@ -0,0 +1,2590 @@ + + + + + + + + + + + + + Putting it all together + + + + + + + + + + + + + + + +
+
+ +
+

Putting it all together

+ +
+
+ +
+
+

Working reproducibly will make your research life a lot easier!

+


+
+
+

+
+
+
+
+

Take control of your research by making its different components reproducible

+


+
+
+

+
+
+
+
+

What have we learned?

+
+
+

+
+
+
    +
  • How to use the version control system Git to track changes to code
  • +
  • How to use the package and environment manager Conda
  • +
  • How to use the workflow managers Snakemake and Nextflow
  • +
  • How to use Quarto and Jupyter to generate automated reports and to document your analyses
  • +
  • How to use Docker and Apptainer to distribute containerized computational environments
  • +
+
+
+

Divide your work into distinct projects

+
+
    +
  • Keep all files needed to go from raw data to final results in a dedicated directory
  • +
  • Use relevant subdirectories
  • +
  • Use Git to version control your projects
  • +
  • Do not store data and results/output in your Git repository
  • +
  • When in doubt, commit often rather than not
  • +
+
+
+
+

Find your own project structure

+

For example:

+
code/             Code needed to go from input files to final results
+data/             Raw data - this should never edited
+doc/              Documentation of the project
+env/              Environment-related files, e.g. Conda environments or Dockerfiles
+results/          Output from workflows and analyses
+README.md         Project description and instructions
+


+

More examples:

+ +
+
+

Treasure your data

+
+
    +
  • Keep your raw data read-only and static
  • +
  • Don’t create different versions of the input data - write a script, Quarto document, Jupyter notebook or a Snakemake / Nextflow workflow if you need to pre-process your input data so that the steps can be recreated
  • +
  • Backup! Keep redundant copies in different physical locations
  • +
  • Upload your raw data as soon as possible to a public data repository
  • +
+
+
+
+

Organize your coding

+
+
    +
  • Avoid generating files interactively or doing things by hand
  • +
  • Write scripts, R Markdown documents, Jupyter notebooks or Snakemake / Nextflow workflows for reproducible results to connect raw data to final results
  • +
  • Keep the parameters separate (e.g. at top of file or in a separate configuration file)
  • +
+
+
+
+

What is reasonable for your project?

+


+
+
+

+
+
+
+
+

What is reasonable for your project?

+

Minimal

+

Write code in a reproducible way and track your environment

+
    +
  • Track your projects with a Git repository each; publish code with your results on e.g. GitHub
  • +
  • Use Conda to install software in environments that can be exported and installed on a different system
  • +
  • Publish your environment.yml file along with your code
  • +
+
+
+

What is reasonable for your project?

+

Good

+

Structure and document your code with notebooks

+
    +
  • Use Quarto or Jupyter notebooks to better keep track of and document your code
  • +
  • Track your notebooks with Git
  • +
+
+
+

What is reasonable for your project?

+

Great

+

Track the full environment and connect your code in a workflow

+
    +
  • Go one step beyond in tracking your environment using Docker or Apptainer
  • +
  • Convert your code into a Snakemake / Nextflow workflow
  • +
  • Track both your image definitions (e.g. Dockerfiles) as well as your workflows with Git
  • +
+
+
+

Alternatives

+

Version control

+
    +
  • Git – Widely used and a lot of tools available + GitHub/BitBucket.
  • +
  • Mercurial – Distributed model just like Git, close to sourceforge.
  • +
  • Subversion – Centralized model unlike git/mercurial; no local repository on your computer and somewhat easier to use.
  • +
+
+
+

Alternatives

+

Environment / package managers

+
    +
  • Conda – General purpose environment and package manager. Community-hosted collections of tools at bioconda or conda-forge.
  • +
  • Pip – Package manager for Python, has a large repository at pypi.
  • +
  • Apt/yum/brew – Native package managers for different OS. Integrated in OS and might deal with e.g. update notifications better.
  • +
  • Virtualenv – Environment manager used to set up semi-isolated python environments.
  • +
+
+
+

Alternatives

+

Workflow managers

+
    +
  • Snakemake – Based on Python, easily understandable format, relies on file names.
  • +
  • Nextflow – Based on Groovy, uses data pipes rather than file names to construct the workflow.
  • +
  • Make – Used in software development and has been around since the 70s. Flexible but notoriously obscure syntax.
  • +
  • Galaxy - attempts to make computational biology accessible to researchers without programming experience by using a GUI.
  • +
+
+
+

Alternatives

+

Literate programming

+
    +
  • Quarto - Developed by Posit (previously RStudio), command-line tool focused on generating high-quality documents in a language-agnostic way
  • +
  • Jupyter – Create and share notebooks in a variety of languages and formats by using a web browser.
  • +
  • R Markdown – Developed by Posit (previously RStudio), focused on generating high-quality documents.
  • +
  • Zeppelin – Developed by Apache. Closely integrated with Spark for distributed computing and Big Data applications.
  • +
  • Beaker – Newcomer based on Ipython, just as Jupyter. Has a focus on integrating multiple languages in the same notebook.
  • +
+
+
+

Alternatives

+

Containerization / virtualization

+
    +
  • Docker – Used for packaging and isolating applications in containers. Dockerhub allows for convenient sharing. Requires root access.
  • +
  • Singularity/Apptainer – Simpler Docker alternative geared towards high performance computing. Does not require root.
  • +
  • Podman - open source daemonless container tool similar to docker in many regards
  • +
  • Shifter – Similar ambition as Singularity, but less focus on mobility and more on resource management.
  • +
  • VirtualBox/VMWare – Virtualization rather than containerization. Less lightweight, but no reliance on host kernel.
  • +
+

class: center, middle

+
+
+

“What’s in it for me?”

+


+
+
+

+
+
+
+
+

NBIS Bioinformatics drop-in

+

Any questions related to reproducible research tools and concepts? Talk to an NBIS expert!

+
    +
  • Online (Zoom)
  • +
  • Every Tuesday, 14.00-15.00 (except public holidays)
  • +
  • Check www.nbis.se/events for Zoom link and more info
  • +
+
+
+

Questions?

+ + +
+
+
+ + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/lectures/putting-it-together/putting-it-together.pdf b/lectures/putting-it-together/putting-it-together.pdf deleted file mode 100644 index 54970ed5..00000000 Binary files a/lectures/putting-it-together/putting-it-together.pdf and /dev/null differ diff --git a/lectures/putting-it-together/putting-it-together.qmd b/lectures/putting-it-together/putting-it-together.qmd new file mode 100644 index 00000000..1d31d02c --- /dev/null +++ b/lectures/putting-it-together/putting-it-together.qmd @@ -0,0 +1,191 @@ +--- +title: "Putting it all together" +format: nbis-revealjs +--- + +## Working reproducibly will make your research life a lot easier! + +
+ +![](images/whats_in_it_for_me.png){height=400 fig-align=center} + +## Take control of your research by making its different components reproducible + +
+ +![](../../pages/images/reproducibility_overview.png){height=400 fig-align=center} + +## What have we learned? {.smaller} + +![](../../pages/images/reproducibility_overview_with_logos.png){height=300 fig-align=center} + +- How to use the version control system [Git]{.green} to track changes to code +- How to use the package and environment manager [Conda]{.green} +- How to use the workflow managers [Snakemake]{.green} and [Nextflow]{.green} +- How to use [Quarto]{.green} and [Jupyter]{.green} to generate automated + reports and to document your analyses +- How to use [Docker]{.green} and [Apptainer]{.green} to distribute + containerized computational environments + +## Divide your work into distinct projects + +::: {.incremental} +- Keep all [files]{.green} needed to go from raw data to final results in a dedicated directory +- Use relevant [subdirectories]{.green} +- Use [Git]{.green} to version control your projects +- Do not store data and results/output in your Git repository +- When in doubt, [commit often]{.green} rather than not +::: + +## Find your own project structure + +For example: +```bash +code/ Code needed to go from input files to final results +data/ Raw data - this should never edited +doc/ Documentation of the project +env/ Environment-related files, e.g. Conda environments or Dockerfiles +results/ Output from workflows and analyses +README.md Project description and instructions +``` + +
+ +More examples: + +- [https://github.com/NBISweden/project_template]() +- [https://github.com/fasterius/nbis-support-template]() +- [https://github.com/snakemake-workflows/snakemake-workflow-template]() + +## Treasure your data + +::: {.incremental} +- Keep your raw data [read-only]{.green} and static +- Don't create different versions of the input data - write a [script]{.green}, + [Quarto]{.green} document, [Jupyter]{.green} notebook or a [Snakemake]{.green} + / [Nextflow]{.green} workflow if you need to pre-process your input data so that + the steps can be recreated +- [Backup]{.green}! Keep redundant copies in different physical locations +- Upload your raw data as soon as possible to a [public data repository]{.green} +::: + +## Organize your coding + +::: {.incremental} +- Avoid generating files [interactively]{.green} or doing things [by + hand]{.green} +- Write [scripts]{.green}, [R Markdown]{.green} documents, [Jupyter]{.green} + notebooks or [Snakemake]{.green} / [Nextflow]{.green} workflows for + reproducible results to connect raw data to final results +- Keep the [parameters]{.green} separate (_e.g._ at top of file or in a separate + configuration file) +::: + +## What is reasonable for your project? + +
+ +![](../../pages/images/reproducibility_overview_with_logos.png){height=450 fig-align=center} + +## What is reasonable for your project? + +[Minimal]{.green} + +_Write code in a reproducible way and track your environment_ + +- Track your projects with a [Git]{.green} repository each; publish code with + your results on _e.g._ [GitHub](https://github.com) +- Use [Conda]{.green} to install software in environments that can be exported + and installed on a different system +- Publish your `environment.yml` file along with your code + +## What is reasonable for your project? + +[Good]{.green} + +_Structure and document your code with notebooks_ + +- Use [Quarto]{.green} or [Jupyter]{.green} notebooks to better keep track of + and document your code +- Track your notebooks with Git + +## What is reasonable for your project? + +[Great]{.green} + +_Track the **full** environment and connect your code in a workflow_ + +- Go one step beyond in tracking your environment using [Docker]{.green} or + [Apptainer]{.green} +- Convert your code into a [Snakemake]{.green} / [Nextflow]{.green} workflow +- Track both your image definitions (_e.g._ Dockerfiles) as well as your + workflows with Git + +## Alternatives + +[Version control]{.green} + +- **Git** – Widely used and a lot of tools available + GitHub/BitBucket. +- **Mercurial** – Distributed model just like Git, close to sourceforge. +- **Subversion** – Centralized model unlike git/mercurial; no local repository on your computer and somewhat easier to use. + +## Alternatives + +[Environment / package managers]{.green} + +- **Conda** – General purpose environment and package manager. Community-hosted collections of tools at bioconda or conda-forge. +- **Pip** – Package manager for Python, has a large repository at pypi. +- **Apt/yum/brew** – Native package managers for different OS. Integrated in OS and might deal with _e.g._ update notifications better. +- **Virtualenv** – Environment manager used to set up semi-isolated python environments. + +## Alternatives + +[Workflow managers]{.green} + +- **Snakemake** – Based on Python, easily understandable format, relies on file names. +- **Nextflow** – Based on Groovy, uses data pipes rather than file names to construct the workflow. +- **Make** – Used in software development and has been around since the 70s. Flexible but notoriously obscure syntax. +- **Galaxy** - attempts to make computational biology accessible to researchers without programming experience by using a GUI. + +## Alternatives + +[Literate programming]{.green} + +- **Quarto** - Developed by Posit (previously RStudio), command-line tool focused + on generating high-quality documents in a language-agnostic way +- **Jupyter** – Create and share notebooks in a variety of languages and formats + by using a web browser. +- **R Markdown** – Developed by Posit (previously RStudio), focused on + generating high-quality documents. +- **Zeppelin** – Developed by Apache. Closely integrated with Spark for + distributed computing and Big Data applications. +- **Beaker** – Newcomer based on Ipython, just as Jupyter. Has a focus on + integrating multiple languages in the same notebook. + +## Alternatives + +[Containerization / virtualization]{.green} + +- **Docker** – Used for packaging and isolating applications in containers. Dockerhub allows for convenient sharing. Requires root access. +- **Singularity/Apptainer** – Simpler Docker alternative geared towards high performance computing. Does not require root. +- **Podman** - open source daemonless container tool similar to docker in many regards +- **Shifter** – Similar ambition as Singularity, but less focus on mobility and more on resource management. +- **VirtualBox/VMWare** – Virtualization rather than containerization. Less lightweight, but no reliance on host kernel. + +class: center, middle + +## "What's in it for me?" + +
+ +![](images/calvin_hobbes_past_corresponding.png){height=450 fig-align=center} + +## NBIS Bioinformatics drop-in + +Any questions related to reproducible research tools and concepts? Talk to an NBIS expert! + +- Online ([Zoom]{.green}) +- Every [Tuesday, 14.00-15.00]{.green} (except public holidays) +- Check [www.nbis.se/events]{.green} for Zoom link and more info + +# Questions?