Add subsampling #50

nggvs · 2024-10-28T15:27:59Z

Closes #28.
Default 0 means that no subsampling is performed.
The subsampling step is performed at the very beginning of the pipeline.

Regarding this:

Additionally, the module should be modified to allow for an optional value channel to specify the seed to make the subsampling deterministic and reproducible. This is required for testing and also useful in certain real-world scenarios.

SeqKit sample already uses a default according to the documentation:
https://bioinf.shenwei.me/seqkit/usage/#sample see -s

Snapshots updated after merging #49

PR checklist

github-actions · 2024-10-28T15:29:39Z

`nf-core pipelines lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit 5a0a3e8

+| ✅ 190 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗  21 tests had warnings |!

❗ Test warnings:

readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
pipeline_todos - TODO string in README.md: TODO nf-core:
pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
pipeline_todos - TODO string in test.config: Specify the paths to your test data on nf-core/test-datasets
pipeline_todos - TODO string in test.config: Give any required params for the test so that command line flags are not needed
pipeline_todos - TODO string in base.config: Check the defaults for all processes
pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required

❔ Tests ignored:

files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-seqinspector_logo_light.png
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-seqinspector_logo_light.png
files_exist - File found: docs/images/nf-core-seqinspector_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: conf/igenomes_ignored.config
files_exist - File found: .github/workflows/awstest.yml
files_exist - File found: .github/workflows/awsfulltest.yml
files_exist - File found: modules.json
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: docs/images/nf-core-seqinspector_logo.png
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/NfcoreTemplate.groovy
files_exist - File not found check: lib/Utils.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: lib/WorkflowMain.groovy
files_exist - File not found check: lib/WorkflowSeqinspector.groovy
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: Singularity
files_exist - File not found check: lib/nfcore_external_java_deps.jar
files_exist - File not found check: .travis.yml
nextflow_config - Found nf-schema plugin
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: validation.help.enabled
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable found: validation.help.beforeText
nextflow_config - Config variable found: validation.help.afterText
nextflow_config - Config variable found: validation.help.command
nextflow_config - Config variable found: validation.summary.beforeText
nextflow_config - Config variable found: validation.summary.afterText
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config variable (correctly) not found: params.max_cpus
nextflow_config - Config variable (correctly) not found: params.max_memory
nextflow_config - Config variable (correctly) not found: params.max_time
nextflow_config - Config variable (correctly) not found: params.validationFailUnrecognisedParams
nextflow_config - Config variable (correctly) not found: params.validationLenientMode
nextflow_config - Config variable (correctly) not found: params.validationSchemaIgnoreParams
nextflow_config - Config variable (correctly) not found: params.validationShowHiddenParams
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 1.0dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.sample_size= 0
nextflow_config - Config default value correct: params.igenomes_base= s3://ngi-igenomes/igenomes/
nextflow_config - Config default value correct: params.custom_config_version= master
nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Config default value correct: params.publish_dir_mode= copy
nextflow_config - Config default value correct: params.max_multiqc_email_size= 25.MB
nextflow_config - Config default value correct: params.validate_params= true
nextflow_config - Config default value correct: params.pipelines_testdata_base_path= https://raw.githubusercontent.com/nf-core/test-datasets/
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - assets/nf-core-seqinspector_logo_light.png matches the template
files_unchanged - docs/images/nf-core-seqinspector_logo_light.png matches the template
files_unchanged - docs/images/nf-core-seqinspector_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml does not use -profile test
readme - README Nextflow minimum version badge matched config. Badge: 24.04.2, Config: 24.04.2
plugin_includes - No wrong validation plugin imports have been found
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (0 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: nf-test.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
actions_schema_validation - Workflow validation passed: awsfulltest.yml
actions_schema_validation - Workflow validation passed: template_version_comment.yml
actions_schema_validation - Workflow validation passed: download_pipeline.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: awstest.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
multiqc_config - assets/multiqc_config.yml found and not ignored.
multiqc_config - assets/multiqc_config.yml contains report_section_order
multiqc_config - assets/multiqc_config.yml contains export_plots
multiqc_config - assets/multiqc_config.yml contains report_comment
multiqc_config - assets/multiqc_config.yml follows the ordering scheme of the minimally required plugins.
multiqc_config - assets/multiqc_config.yml contains a matching 'report_comment'.
multiqc_config - assets/multiqc_config.yml contains 'export_plots: true'.
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'
base_config - conf/base.config found and not ignored.
modules_config - conf/modules.config found and not ignored.
modules_config - SEQTK_SAMPLE found in conf/modules.config and Nextflow scripts.
modules_config - FASTQC found in conf/modules.config and Nextflow scripts.
modules_config - MULTIQC_GLOBAL found in conf/modules.config and Nextflow scripts.
modules_config - MULTIQC_PER_TAG found in conf/modules.config and Nextflow scripts.
nfcore_yml - Repository type in .nf-core.yml is valid: pipeline
nfcore_yml - nf-core version in .nf-core.yml is set to the latest version: 3.0.2

Run details

nf-core/tools version 3.0.2
Run at 2024-10-30 09:43:06

MatthiasZepper

Thanks a lot for your contribution! The subsampling functionality will be very important to Seqinspector, therefore it is great that you decided to work along this line!

Unfortunately, there is still some room for improvement, but I am optimistic that all of those issues can be resolved during the next two days of the Hackathon.

Generally, I think, that most users would want to select a fraction of reads to subsample rather than deciding for an absolute number of reads. Therefore, your initial idea of using SeqKit sample, which has direct support for probabilities, would make a lot of sense, however, require writing the module first. If going with Seqtk, I think it would be fine merging this PR with absolute numbers only, but if you feel enthusiastic, you could patch the module to include a read counting step and set the sample size according to the number of input reads.

docs/output.md

docs/usage.md

nextflow_schema.json

MatthiasZepper · 2024-10-28T18:08:29Z

modules/nf-core/seqtk/sample/main.nf

+    script:
+    def args   = task.ext.args ?: ''
+    def prefix = task.ext.prefix ?: "${meta.id}"
+    if (!(args ==~ /.*-s[0-9]+.*/)) {


Do you think it would be useful to expose the random seed of the tool as (possibly hidden) pipeline parameter?

I'm not sure about this, is there any example you have in mind?

I am unsure as well. My experience from the RNA-seq pipeline is, that people usually struggle with custom configuration files.

The extra_* convenience parameters of the pipeline (extra_star_align_args or extra_ kallistro_quant_args), which allow modifying the arguments of those tools without a custom config are very popular.

Therefore, I about setting the seed via a parameter, but on a second thought, I agree that the need for a custom config is probably fine.

I can understand the difficulties with custom configuration files, the extra arguments are ok, but on the other side I have the impression that might be easily forgotten when setting the workflow. I think may be better to check if there are paired reads by checking the input channel or just if there is _1 and _2 files, and if present, enables the -s option?

MatthiasZepper · 2024-10-28T18:21:38Z

workflows/seqinspector.nf

+    // MODULE: Run Seqkit sample to perform subsampling
+    //
+    if (params.sample_size > 0 ) {
+        ch_sample_sized = SEQTK_SAMPLE(ch_samplesheet.map {


Unfortunately, I don't think the module is capable of processing paired files. The description of the tool suggests that it has to be invoked with the same random seed on each FastQ file separately:

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

seqtk sample -s100 read1.fq 10000 > sub1.fq seqtk sample -s100 read2.fq 10000 > sub2.fq

and all of the module tests only use a single file.

When you invoke the module, you will have to account for that.

Ok, you know if at least one of the tests have paired reads so I can view how the input channel looks like and also test this functionality?
I have found this post about custom-configuration-files and looking the module, the '-s' option is available using it as ext.args.

I have asked in Barcelona, and apparently there isn't yet a paired-end test, so I has been suggested to open an issue to add it: #55
For now I have added the seed argument in the conf/modules.config

workflows/seqinspector.nf

modules/nf-core/seqtk/sample/main.nf

nggvs

Thanks for the review!

docs/output.md

docs/usage.md

nextflow_schema.json

workflows/seqinspector.nf

nggvs · 2024-10-29T00:56:40Z

workflows/seqinspector.nf

+    // MODULE: Run Seqkit sample to perform subsampling
+    //
+    if (params.sample_size > 0 ) {
+        ch_sample_sized = SEQTK_SAMPLE(ch_samplesheet.map {


Ok, you know if at least one of the tests have paired reads so I can view how the input channel looks like and also test this functionality?
I have found this post about custom-configuration-files and looking the module, the '-s' option is available using it as ext.args.

MatthiasZepper

I think the major issues are addressed, and the other things can be polished later respectively are a matter of taste anyway. Thanks for your contribution!

MatthiasZepper · 2024-10-29T15:57:25Z

conf/modules.config

@@ -18,6 +18,10 @@ process {
        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
    ]

+    withName: SEQTK_SAMPLE {
+        ext.args = '-s100'
+    }


I have no opinion on whether the pipeline should publish the subsampled files. But if you want to output them in the way you describe in the output.md, you will probably need to define a corresponding publishDir directive here? (There is also a new possibility called Workflow Output Schema, that I am however not yet familiar with. Did you use that without me noticing it?)

I think is always output because there is a general statement here:

seqinspector/conf/modules.config

Line 15 in a2d589f

publishDir = [

, you mean something different? I think is nice to have the reads that have been actually used in the rest of the analysis

nggvs added 7 commits October 28, 2024 14:38

add subsampling

88bb65d

add subsampling nf-test

dbd64bf

add snapshots

44fb554

better name

d781db1

update docs

ab30119

change default to 0

b8befe4

update snapshot

135d6a3

nggvs added 3 commits October 28, 2024 15:43

prettify docs

36d80d3

update changelog

96dc9f7

typo

0fd6a95

MatthiasZepper requested changes Oct 28, 2024

View reviewed changes

nggvs added 3 commits October 29, 2024 08:47

suggestions

a417ad3

switch to original reads for fastqc

801220f

update snapshot

e9ae94c

nggvs commented Oct 29, 2024

View reviewed changes

nggvs added 3 commits October 29, 2024 11:14

back to the past

afb2a61

update README

80e49aa

add module config

3b29496

nggvs mentioned this pull request Oct 29, 2024

Add relative sample size input in subsampling #60

Open

MatthiasZepper approved these changes Oct 29, 2024

View reviewed changes

nggvs added 4 commits October 30, 2024 09:13

Merge branch 'dev' into add-subsampling

bfa26af

fix test

88fa0dc

add citation

5079047

fix test

5a0a3e8

nggvs merged commit 5e56fc3 into nf-core:dev Oct 30, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add subsampling #50

Add subsampling #50

nggvs commented Oct 28, 2024 •

edited

Loading

github-actions bot commented Oct 28, 2024 •

edited

Loading

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

MatthiasZepper left a comment •

edited

Loading

MatthiasZepper Oct 28, 2024

nggvs Oct 29, 2024

MatthiasZepper Oct 29, 2024 •

edited

Loading

nggvs Oct 30, 2024

MatthiasZepper Oct 28, 2024

nggvs Oct 29, 2024

nggvs Oct 29, 2024

nggvs left a comment

nggvs Oct 29, 2024

MatthiasZepper left a comment •

edited

Loading

MatthiasZepper Oct 29, 2024

nggvs Oct 30, 2024

Add subsampling #50

Add subsampling #50

Conversation

nggvs commented Oct 28, 2024 • edited Loading

PR checklist

github-actions bot commented Oct 28, 2024 • edited Loading

nf-core pipelines lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

MatthiasZepper left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatthiasZepper Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nggvs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatthiasZepper left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nggvs commented Oct 28, 2024 •

edited

Loading

github-actions bot commented Oct 28, 2024 •

edited

Loading

`nf-core pipelines lint` overall result: Passed ✅ ⚠️

MatthiasZepper left a comment •

edited

Loading

MatthiasZepper Oct 29, 2024 •

edited

Loading

MatthiasZepper left a comment •

edited

Loading