Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Improve documentation for multiple: true #752

Open
DriesSchaumont opened this issue Jul 24, 2024 · 0 comments
Open

[DOC] Improve documentation for multiple: true #752

DriesSchaumont opened this issue Jul 24, 2024 · 0 comments

Comments

@DriesSchaumont
Copy link
Contributor

DriesSchaumont commented Jul 24, 2024

The documentation for multiple: true for the upcomming release candidate is quite confusing:
unnamed (1)
(notice the multiple_sep: both ;, : and , are being used)

Additionally, the documentation on what the expected behavior and/or intended use of multiple: true in combination with direction: output is, could use some clarification.

a) I've noticed that there is some confusion in expected logic when comparing multiple: false to multiple: true (wrt direction: output). For multiple: false, viash expects the code for the component to create a file with that exact name (e.g. when using --output_fastq a.fastq, a file a.fastq should be created). For multiple: true, there are two options: 1. the component creates a number of files, and the value provided to the argument only acts as mask/sieve to capture the desired output files. Here, we expect the user to know about the format of the output files that has been created. If the user uses --output_fastq *.fa, but the component returns *.fastq files, its a problem. This also is more complicated when the generated files depend on other input (e.g. the component generates *.fastq.gz files when using an argument --compression gzip but *.fastq when not enabling compression). So the user either needs to look at the internals of the component to know which glob to specify (or this needs to be really well documented per component). 2. Continuing with the second option; a component can take the value provided by the user (e.g *.fastq or *.fq) and create output files that matches with the provided value. This has the benefit that this corresponds with the behavior of multiple: false. The downside is that the implementer of the component is responsible for filling in the wildcard and creating files with the correct names. This is also not different compared to using multiple: false (where the correct output file must be created), but it requires extra logic from the developer and it is not possible for viash to check if the correct number of files have been created. However, the question becomes: if we want to convert the glob value provided on the command line to filenames within the script code, what can we expect from the glob pattern that is provided to the script (for example; what wildcard values are taken into account?) This ties into the next question, (see b) below.

When I look at the two options above, I am in favour of recommending option number 2, because it most closely resembles the behaviour that is expected with multiple: false: the names of the output files reflect the provided value for the argument. However, if I am not mistaken there is not real way to validate that this is being done because this is script logic (i.e. it is up to the developer to do this). We could, however, choose an option, document a recommendation and apply the recommendation in for example biobox.

Perhaps a bit of background why I am bringing this up: I have seen some confusion that it is possible to use something like --fastq_ouput a.fastq;b.fastq but an error message that a wildcard character must be used is presented. Of course, this format cannot be used because it is not generally expandable when the number of output files is variable. The confusion probably originates from the familiarity with arguments with direction: input. I think the error message might benefit from a more elaborate explanation, because it triggers the follow-up questions: why is the wildcard needed and how do I choose a correct value for it. Answering the latter question is not as easy because of reasons outlined above. One additional option that sprung to mind (just leaving it here as a mental note) is to introduce a different type for the argument (something like type: glob), just to indicate to the user that the behavior of direction: output with multiple: true really is different from all the other variants of type: file.

b). Currently, the only wildcard character that is being checked in the code is * (see BashWrapper and Nextflow ). If this is intended, I think we should make this explicit in the documentation by rewording for example a wildcard character to the wildcard character '*'. This way, the code for the component can also work based on the assumption that only the * character should be interpreted as a wildcard (and not ? and [] or other bash globs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant