Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subsampling #50

Merged
merged 20 commits into from
Oct 30, 2024
Merged

Add subsampling #50

merged 20 commits into from
Oct 30, 2024

Conversation

nggvs
Copy link

@nggvs nggvs commented Oct 28, 2024

Closes #28.
Default 0 means that no subsampling is performed.
The subsampling step is performed at the very beginning of the pipeline.

Regarding this:

Additionally, the module should be modified to allow for an optional value channel to specify the seed to make the subsampling deterministic and reproducible. This is required for testing and also useful in certain real-world scenarios.

SeqKit sample already uses a default according to the documentation:
https://bioinf.shenwei.me/seqkit/usage/#sample see -s

Snapshots updated after merging #49

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/seqinspector branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nf-test test main.nf.test -profile test,docker).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

github-actions bot commented Oct 28, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 5a0a3e8

+| ✅ 190 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗  21 tests had warnings |!

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
  • pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
  • pipeline_todos - TODO string in README.md: TODO nf-core:
  • pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
  • pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in test.config: Specify the paths to your test data on nf-core/test-datasets
  • pipeline_todos - TODO string in test.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-10-30 09:43:06

Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution! The subsampling functionality will be very important to Seqinspector, therefore it is great that you decided to work along this line!

Unfortunately, there is still some room for improvement, but I am optimistic that all of those issues can be resolved during the next two days of the Hackathon.

Generally, I think, that most users would want to select a fraction of reads to subsample rather than deciding for an absolute number of reads. Therefore, your initial idea of using SeqKit sample, which has direct support for probabilities, would make a lot of sense, however, require writing the module first. If going with Seqtk, I think it would be fine merging this PR with absolute numbers only, but if you feel enthusiastic, you could patch the module to include a read counting step and set the sample size according to the number of input reads.

docs/output.md Outdated Show resolved Hide resolved
docs/usage.md Outdated Show resolved Hide resolved
docs/usage.md Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
if (!(args ==~ /.*-s[0-9]+.*/)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be useful to expose the random seed of the tool as (possibly hidden) pipeline parameter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this, is there any example you have in mind?

Copy link
Member

@MatthiasZepper MatthiasZepper Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure as well. My experience from the RNA-seq pipeline is, that people usually struggle with custom configuration files.

The extra_* convenience parameters of the pipeline (extra_star_align_args or extra_ kallistro_quant_args), which allow modifying the arguments of those tools without a custom config are very popular.

Therefore, I about setting the seed via a parameter, but on a second thought, I agree that the need for a custom config is probably fine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand the difficulties with custom configuration files, the extra arguments are ok, but on the other side I have the impression that might be easily forgotten when setting the workflow. I think may be better to check if there are paired reads by checking the input channel or just if there is _1 and _2 files, and if present, enables the -s option?

// MODULE: Run Seqkit sample to perform subsampling
//
if (params.sample_size > 0 ) {
ch_sample_sized = SEQTK_SAMPLE(ch_samplesheet.map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think the module is capable of processing paired files. The description of the tool suggests that it has to be invoked with the same random seed on each FastQ file separately:

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

and all of the module tests only use a single file.

When you invoke the module, you will have to account for that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you know if at least one of the tests have paired reads so I can view how the input channel looks like and also test this functionality?
I have found this post about custom-configuration-files and looking the module, the '-s' option is available using it as ext.args.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have asked in Barcelona, and apparently there isn't yet a paired-end test, so I has been suggested to open an issue to add it: #55
For now I have added the seed argument in the conf/modules.config

workflows/seqinspector.nf Show resolved Hide resolved
modules/nf-core/seqtk/sample/main.nf Show resolved Hide resolved
Copy link
Author

@nggvs nggvs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

docs/output.md Outdated Show resolved Hide resolved
docs/usage.md Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
workflows/seqinspector.nf Show resolved Hide resolved
// MODULE: Run Seqkit sample to perform subsampling
//
if (params.sample_size > 0 ) {
ch_sample_sized = SEQTK_SAMPLE(ch_samplesheet.map {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you know if at least one of the tests have paired reads so I can view how the input channel looks like and also test this functionality?
I have found this post about custom-configuration-files and looking the module, the '-s' option is available using it as ext.args.

Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the major issues are addressed, and the other things can be polished later respectively are a matter of taste anyway. Thanks for your contribution!

@@ -18,6 +18,10 @@ process {
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]

withName: SEQTK_SAMPLE {
ext.args = '-s100'
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no opinion on whether the pipeline should publish the subsampled files. But if you want to output them in the way you describe in the output.md, you will probably need to define a corresponding publishDir directive here? (There is also a new possibility called Workflow Output Schema, that I am however not yet familiar with. Did you use that without me noticing it?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is always output because there is a general statement here:

publishDir = [
, you mean something different? I think is nice to have the reads that have been actually used in the rest of the analysis

@nggvs nggvs merged commit 5e56fc3 into nf-core:dev Oct 30, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants