Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Move some publishDir attributes to process-level directives #4661

Closed
ewels opened this issue Jan 16, 2024 · 13 comments
Closed

[proposal] Move some publishDir attributes to process-level directives #4661

ewels opened this issue Jan 16, 2024 · 13 comments

Comments

@ewels
Copy link
Member

ewels commented Jan 16, 2024

Based on a comment originally posted in #4205 (comment)

publishDir supports multiple attributes that control file publishing behaviour. The problem with having them structured this way is that if any need to be overwritten, then entire publishDir statement has to be repeated. This leads to some attributes being repeated verbatim many times in pipeline configs.

This issue is to suggest moving attributes out to new directives, so that they can be set to a single value for the entire pipeline. This reduces duplicated configuration code. Could be all current attributes, but main priorities (most duplicated) are:

  • path (ideally split into two, with "base" and current path attribute)
  • mode
  • saveAs

If possible, this would be under a publishDir scope, though I don't know if this is possible technically.

So ideally, it'd be something like taking the following example:

publishDir = [
    path: { "${params.outdir}/genome" },
    mode: params.publish_dir_mode,
    saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
    enabled: params.save_reference
]

And it becoming:

// Defaults for all processes
process {
    publishDir { 
        base = params.outdir
        mode = params.publish_dir_mode
        saveAs = { filename -> filename.equals('versions.yml') ? null : filename }
    }
}
// Specific process overrides
process withName: "trimData" {
    publishDir.path: "trimmed"
}
process withName: "prepGenome" {
    publishDir.path: "genome",
    publishDir.enabled: params.save_reference
}
@bentsherman
Copy link
Member

publishDir defaults are tricky because it is a list of publish "commands" that can target specific outputs. I would suggest something like this:

// process directive
process FASTQC {
    publish 'results', mode: 'copy', saveAs: { ... }

    // ...
}

// config option
process {
    publish.path = 'results'
    publish.mode = 'copy'
    publish.saveAs = { ... }
}

It is essentially giving up the ability to target specific outputs differently, instead the setting would apply to the entire process. It seems to be what people do almost every time, but I would like to investigate it more.

But I do suggest making a new process directive, even something as simple as publishDir -> publish, so that users have to make an explicit transition, because this new directive would be structured differently in the runtime.

@ewels
Copy link
Member Author

ewels commented Jan 17, 2024

It is essentially giving up the ability to target specific outputs differently, instead the setting would apply to the entire process.

I don't really follow what the difference is here.. You mean that with the current publishDir syntax, the different options can be set on a per-file basis somehow? Is that not what you're still doing with the saveAs squiggly brackets in the example above?

But I do suggest making a new process directive,

Yup, absolutely - good point 👍🏻

@bentsherman
Copy link
Member

I was thinking of the pattern option which allows you to select which outputs to publish. But I guess you're right that you can do the same thing with saveAs.

@pditommaso
Copy link
Member

pditommaso commented Jan 17, 2024

it's a -1 for me, we are going to deprecate publishDir and keep these info in the (module) output schema

@bentsherman
Copy link
Member

@pditommaso do you think the publish mode i.e. copy vs symlink should be defined in the output schema?

@pditommaso
Copy link
Member

I'd see a top level (default) setting, there's no really need to configure task by task

@bentsherman
Copy link
Member

I could live with that, although I have a feeling there are users out there who are setting it differently per process

@ewels
Copy link
Member Author

ewels commented Jan 17, 2024

Definitely, I know of at least one use case for having mode process specific: some tools hate symlinks so you might want to set mode to copy for a single process (not the whole pipeline as that wastes disk space).

The other option in that issue is enabled - this is nearly always process-specific. For example, a process might be preparing a reference genome index - you don't want to save this as an output by default, but users on HPC might want to keep it so they set --save_reference_genome which selectively sets enabled back on again for those processes.

Setting them as defaults at top-level that can be overwritten with process-specific config would be great though. That's basically what this issue is all about ☝🏻 But when it's a single directive, I don't think that individual attributes can be set as default / overwritten like that, which is why I was proposing to split them up. Am I wrong about that?

@marcodelapierre
Copy link
Member

marcodelapierre commented Feb 19, 2024

it's a -1 for me, we are going to deprecate publishDir and keep these info in the (module) output schema

I would see value in also having an upgraded publishDir like configuration, aside of the schema, rather than deprecating completely. As such, I like Ben's suggestion above for instance.

My impression is that having the schema as the sole way to specify outputs (and inputs) can constitute a barrier to adoption for new users. Beginners and simple pipelines would benefit greatly from having a fall-back publishDir-like option . Thoughts?

@pditommaso
Copy link
Member

@ewels My take is that these directives are needed because you still continue to think in publishDir model in which the pipeline output is defined at process level and it was designed pre-DLS2.

We need to move away from this model where the workflow output is buried in the process definition and then there's a huge config file to control all these settings.

I don't exclude that some of these settings can become a directive, but we have to see case by case to address were specific needs.

@marcodelapierre good point, it should not definitely become a mandatory requirement to create a schema file to be able to have your workflow writing some output file. I agree it would create too much friction.

@ewels
Copy link
Member Author

ewels commented Feb 21, 2024

Agreed - the above proposal would basically be a stop-gap: hopefully quite easy to implement / a minor change with immediate benefit. Then it can be replaced by a new system when it's ready.

@pditommaso
Copy link
Member

A stop-gap is foverever .. We need to move fast with #4670, instead.

@bentsherman
Copy link
Member

Closing in favor of #4784

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants