Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cutadapt fixes and improvements #5954

Merged
merged 19 commits into from
May 17, 2024
Merged

Conversation

wm75
Copy link
Contributor

@wm75 wm75 commented Apr 16, 2024

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

Now here's a list of changes:

  • makes 1 the default for --minimum-length because of the reason described here: cutadapt fixes and improvements #5954 (comment)
  • groups forward and rv filtering and trimming options together, which were previously separated quite a bit in the tool interface
  • groups and orders parameters in a more intuitive way (I believe)
  • fixes the --discard-casava and --max-aer options that previously had no effect
  • fixes the edge case where --minimum_length2 would get ignored when minimum_length wasn't set
  • fixes same issue for --maximum_length2
  • switches the tool to use short options where possible to shorten the generated command line
  • adds support for the "new" --pair-adapters option
  • fixes output filter expressions to not trigger silent errors like:
    dataset output filter (library['type'] == 'paired' and 'multiple_output' not in output_selector) failed: argument of type 'NoneType' is not iterable

@bernt-matthias
Copy link
Contributor

bernt-matthias commented Apr 16, 2024

Would named macro tokens be an option, i.e. using <yield name="xyz"/>?

Then we could insert the read2 parameters only if needed .. and at the correct positions.

--minimum-length=$filter_options.minimum_length:$library.minimum_length2
#else if str($filter_options.minimum_length):
--minimum-length=$filter_options.minimum_length
#if $paired and str($filter_options.minimum_length2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should minimum_length2 be treated the same way as minimum_length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it shouldn't. That part is a bit confusing, but minimum_length2 should normally be unset, in which case the value of minimum_length will also be used for rv reads in PE data.
Only if the user sets it explicitly, it special-cases treatment of rv reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if a user specifies a value for minimum_length2 we need to use the colon syntax to pass both params. Otherwise, we need to pass only minimum_length unless its set to the command line default anyway.
The old version was buggy because it always ignored minimum_length2 when the user didn't specify minimum_length.

@@ -227,32 +284,19 @@ $read_mod_options.zero_cap
<param argument="--times" type="integer" min="1" value="1" label="Match times" help="Try to remove adapters at most COUNT times. Useful when an adapter gets appended multiple times." />
<param argument="--overlap" type="integer" min="1" value="3" label="Minimum overlap length" help="Minimum overlap length. If the overlap between the adapter and the sequence is shorter than LENGTH, the read is not modified. This reduces the number of bases trimmed purely due to short random adapter matches." />
<param argument="--match-read-wildcards" type="boolean" checked="false" truevalue="--match-read-wildcards" falsevalue="" label="Match wilcards in reads" help="Interpret IUPAC wildcards in reads"/>
<param argument="--no-match-adapter-wildcards" type="boolean" checked="true" truevalue="" falsevalue="--no-match-adapter-wildcards" label="Match wilcards in adapters" help="Interpret IUPAC wildcards in adapters."/>
<param argument="--revcomp" type="boolean" checked="false" truevalue="--revcomp" falsevalue="" label="Look for adapters in the reverse complement" help="Check both the read and its reverse complement for adapter matches. If match is on reverse-complemented version, output that one. Default: check only read." />
<param argument="--no-match-adapter-wildcards" type="boolean" checked="true" truevalue="" falsevalue="-N" label="Match wildcards in adapters" help="Interpret IUPAC wildcards in adapters."/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the long one? Bit confusing, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could sure, but the current vversion just produces unnecessary long command lines, which need to be stored by Galaxy. Not sold on using short option names throughout though. @bgruening what's your opinion here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use argument, it will be part of the help text. From an educational point of view it would be nice to show the same param in the help and in the CLI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. https://galaxy-iuc-standards.readthedocs.io/en/latest/best_practices/tool_xml.html#parameter-name-argument-and-help says that argument should use the long form of the wrapped option, which then means that we would always use the long form on the command line, too.
If that's the consensus, then I'll revert the commit introducing all the short options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

<section name="filter_options" title="Read Filtering Options">
<param argument="--discard-trimmed" type="boolean" checked="false" truevalue="--discard-trimmed" falsevalue="" label="Discard Trimmed Reads" help="Discard reads that contain the adapter instead of trimming them. Use the 'Minimum overlap length' option in order to avoid throwing away too many randomly matching reads!" />
<param argument="--discard_untrimmed" type="boolean" checked="false" truevalue="--discard-untrimmed" falsevalue="" label="Discard Untrimmed Reads" help="Discard reads that do not contain the adapter." />
<param argument="--minimum-length" type="integer" min="0" value="1" label="Minimum length (R1)" help="Discard reads that, after processing, are shorter than LENGTH. Note: You can set this parameter to zero to keep empty reads (with zero-length sequence and quality string) in the output, but some downstream tools may have problems with these. Default: 1" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default of the command line tool 1? I most of the time go with those defaults .. but I see the argument for not doing so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted :-)

The problem with the 0 default of the command line tool was what triggered this whole PR.
With this setting it's possible that the trimmed output contains empty reads like here:

@prefix:1_13_1259/1
+
@prefix:1_13_1440/1

We learnt this week that such output causes failures in RNAStar and picard Fastq2Sam, and floods the stdout of HISAT2 with warnings. Enough reason for us to change the default.

Sorry, wasn't finished yet with an explanation of the PR and didn't expect so fast a review ;-)

These tests need -m 0, which was the old default for the param.
@wm75
Copy link
Contributor Author

wm75 commented Apr 16, 2024

Would named macro tokens be an option, i.e. using <yield name="xyz"/>?

Then we could insert the read2 parameters only if needed .. and at the correct positions.

That would require a huge macro though that would span like 2/3 of the whole wrapper if I understand you correctly. Would also change the nesting level of tons of params.

Do we want go that far to provide the best user experience?

@wm75
Copy link
Contributor Author

wm75 commented Apr 17, 2024

@bernt-matthias ready for another round of comments/review :-)

wm75 added 3 commits April 23, 2024 12:44
Switch back to use long options where possible to match param argument
values. Also use = as param/value separator because this gives nicer
cutadapt json output.
@bgruening
Copy link
Member

LGTM, @bernt-matthias?

<option value="both">Both: filtering criteria must apply to both reads in order for a read pair to be discarded. </option>
<option value="first">First: will make a decision about the read pair by inspecting whether the filtering criterion applies to the first read, ignoring the second read.</option>

<section name="other_trimming_options" title="Other Read Trimming Options">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to require rechecking every parameter within the section changes in workflows, as we can't track that automatically. Is that worth it ? Not saying it isn't, but it's something that looks like a good amount of work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a point here, which is worth considering.
Personally, I think all this restructuring is worth it because it will make usage of the tool a lot clearer for the average user, but that's just my opinion.
The specific change you've highlighted affects just 3 parameters I think, but, yes, I've moved around quite some others between section as well.
My point would be that it's better to move everything in one update, then hope that we'll be able to keep things constant for a while, than doing lots of incremental changes that each break workflow updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific change you've highlighted affects just 3 parameters I think,

Got lost trying to read the diff so that's not true. More like 10 params.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very unlikely that any WF deviates from the default for a lot of them at once, but still you need to check whether that's the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we could do as a compromise is to combine things in the existing "filter_options" section, change its title to sth like "Other trimming and filtering options" and simply reorder things properly within that section so that trimming options come before filtering ones. You're right that that would minimize effort during WF updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now that I looked at this a bit more carefully: the options that I've moved into that new Other Trimming Options section all come from the Read Modifications section, which really mixed up different concepts.

image

The problem there is that the trimming options (cutting, trimming, shortening) should be presented before the filtering options because trimming also happens before filtering, but the remaining read modification options are really quite exotic and should go last I think. So a split of this section then seems unavoidable.

Currently, the split moves 6 out of 10 options to the new section and leaves 4 in the original one. Also these 6 are more commonly used ones, then the 4 that are left behind.
What I could do is to move the existing section up above filtering and split the 4 more exotic options out into a new bottom section. This way "only" 4 instead of 6 options would have to be rechecked on WF updates.
The downside is that the section that I wanted to name "other_trimming_options" will then be called "read_mod_options", and we'd have to think about a new name for the bottom section.

So al in all, a moderate gain in terms of WF maintainability vs somewhat inadequate internal section names. I'm undecided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what things would look like with the proposed split for comparison:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do change the parameters I would stick with changing all at once, whether you check 4 or 10 options doesn't make a difference I think.

@bgruening
Copy link
Member

@mvdbeek any feedback to Wolfgangs latest comments?

@bgruening bgruening merged commit a5b6cb4 into galaxyproject:main May 17, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants