Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve WrrocRenderer #33

Open
wants to merge 9 commits into
base: workflow-run-crate
Choose a base branch
from

Conversation

fbartusch
Copy link

Add workflow file to crate.
Correct path to output files in metadata file.
Add CreateActions for tool executions.
Add ControlActions.
Add SoftwareApplications for processes.
Add HowToSteps.
Add agent information (orcid, name) to nextflow.config that is incorporated into metadata file.

Add workflow file to crate.
Correct path to output files in metadata file.
Add CreateActions for tool executions.
Add ControlActions.
Add SoftwareApplications for processes.
Add HowToSteps.
Add agent information (orcid, name) to nextflow.config that is incorporated into metadata file.
@simleo
Copy link

simleo commented Jun 20, 2024

I've installed the plugin with make install and used the following nextflow.config:

plugins {
    id '[email protected]'
}

params {
    outdir = 'results'
}

prov {
    enabled = true
    formats {
        wrroc {
            file = "${params.outdir}/ro-crate-metadata.json"
            overwrite = true
            agent {
                name = "John Doe"
                orcid = "https://orcid.org/0000-0000-0000-0000"
            }
        }
    }
}

I had to pin the plugin's version because otherwise Nextflow downloads nf-prov-1.2.2 and uses that (which fails because it does not support wrroc as a format). Then I ran nextflow run tests/test.nf and got a results RO-Crate. Here are the problems I've found:

  • The CreateAction corresponding to test.nf (#606ce08c-cac1-4777-8a1b-83f497a0c63d) has an "id" property which should be "@id" instead, so the RO-Crate is not valid. I've manually changed that to "@id" in order to continue with the review.

  • The other CreateAction entities have a "result" property that has a nested list, which shouldn't be there. For instance, this:

"result": [
    [
        {
            "@id": "r1.foo.1.txt"
        },
        {
            "@id": "r1.foo.2.txt"
        }
    ]
],

should be:

"result": [
    {
        "@id": "r1.foo.1.txt"
    },
    {
        "@id": "r1.foo.2.txt"
    }
],

I made these changes manually to move on with the review. At this point, Runcrate reads the crate (runcrate report results) without crashing. Here's the output:

action: #606ce08c-cac1-4777-8a1b-83f497a0c63d
  instrument: test.nf (['File', 'SoftwareSourceCode', 'ComputationalWorkflow', 'HowTo'])
  started: 2024-06-20T11:59:14.123266278+02:00
  ended: 2024-06-20T11:59:14.897242324+02:00
  outputs:
    r1.foo.1.txt
    r1.foo.2.txt
    r2.foo.2.txt
    r2.foo.1.txt
    r3.foo.1.txt
    r3.foo.2.txt

action: #71e33a99d3115bb57a7cdfc1b220fbbf
  step: test.nf#main/RNG
  instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
  outputs:
    r1.foo.1.txt
    r1.foo.2.txt

action: #a9b6314416b001fb728ff45fa29d6f5e
  step: test.nf#main/RNG
  instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
  outputs:
    r2.foo.1.txt
    r2.foo.2.txt

action: #45b19d72e22ba819cf02b88e4977bc86
  step: test.nf#main/RNG
  instrument: #Script_923b8caac4a54f39@66e17eff (SoftwareApplication)
  outputs:
    r3.foo.1.txt
    r3.foo.2.txt
  • The "@id" of ControlAction 3968b3e5-b5c8-4b24-9e03-d118801e589e should be #3968b3e5-b5c8-4b24-9e03-d118801e589e.

  • The ComputationalWorkflow has outdir as an input parameter, but the only parameter you can set when launching the workflow is constant (which is not listed as an input parameter in the RO-Crate metadata). If I try to change the value of outdir from results to crate (either in nextflow.config or on the command line with --outdir crate), files are still output into results; moreover, in this case the ro-crate-metadata.json and test.nf files are missing (they can be found in the crate dir but only if you create it manually before running the workflow).

I've also noticed a weird thing: if you run rm results -rf && nextflow run tests/test.nf (without changing the value of outdir in nextflow.config) repeatedly, sometimes (like 1 in 20 times) the ro-crate-metadata.json file is missing from the results dir.

@fbartusch
Copy link
Author

Thanks for the comprehensive feedback @simleo . The next time I will check if runcrate can parse the document before asking for feedback ...

I fixed most of the issues, but this causes some headache:

sometimes (like 1 in 20 times) the ro-crate-metadata.json file is missing from the results dir.

I can reproduce the problem.
I wrote a script that removes the results directory, reruns the pipeline and checks if ro-crate-metadata.json ist present. It seems that the workflowOutputs mapping is missing entries for some output files.

Next I will use the newest Nextflow version and maybe the newest nf-prov version, maybe this problem is addressed there. For me it doesn't seem that the problem originates from the WrrocRenderer itself.

Use outdir from config as publishDir
Remove nested list for createAction results
Correct controlAction id
@simleo
Copy link

simleo commented Jul 25, 2024

I've run the workflow again with the plugin at ac05da2, using the same nextflow.config. The issues I've reported in my previous comment seem to be solved (the sometimes-missing metadata file problem should be solved after the integration of c92b21f, if I understand correctly). Great job! This is another round of review.

  • if the workflow is run without passing parameters (nextflow run tests/test.nf), the #constant FormalParameter is missing from the RO-Crate metadata (only #outdir is present). If it's run with nextflow run tests/test.nf --constant bar, #constant is present. It should be present in both cases.

  • The CreateAction corresponding to the workfow execution has an empty "object". It should contain references to PropertyValue entities corresponding to the values taken by the parameters during the specific run. For instance, when run with nextflow run tests/test.nf --outdir crate --constant bar, the result should look like this:

{
    "@id": "#outdir",
    "@type": "FormalParameter",
    "name": "outdir",
    ...
},
{
    "@id": "#constant",
    "@type": "FormalParameter",
    "name": "constant",
    ...
},
{
    "@id": "#b5236af9-95e9-42b8-9121-a6a57755ee1b",
    "@type": "CreateAction",
    "instrument": {
        "@id": "test.nf"
    },
    "object": [
        {"@id": "#outdir-pv"},
        {"@id": "#constant-pv"}
    ],
    "result": [
        {"@id": "r3.bar.2.txt"},
        {"@id": "r3.bar.1.txt"},
        {"@id": "r2.bar.1.txt"},
        {"@id": "r2.bar.2.txt"},
        {"@id": "r1.bar.1.txt"},
        {"@id": "r1.bar.2.txt"}
    ]
    ...
},
{
    "@id": "#outdir-pv",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#outdir"},
    "name": "outdir",
    "value": "crate"
},
{
    "@id": "#constant-pv",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#constant"},
    "name": "constant",
    "value": "bar"
}

When run with nextflow run tests/test.nf the outcome should be the same except for the "value" fields of the PropertyValue entities, which should be "results" and "foo" (the defaults). Note that the PropertyValue instances link to the corresponding FormalParameters via exampleOfWork.

  • The three CreateActions corresponding to the executions of RNG should also have an "object" pointing to PropertyValue instances, which in turn should be connected via exampleOfWork to FormalParameters of the SoftwareApplication corresponding to RNG.

  • Since the output RO-Crate conforms to the Workflow RO-Crate profile, its Root Data Entity must have a "license". This can be specified in nextflow.config, like agent.

  • nextflow.config should be included in the output RO-Crate. In the metadata file, it should be linked via object from the OrganizeAction, as specified in the Provenance Run Crate profile.

Add license information according to Workflow RO-Crate
Handle parameters set in workflow
Add PropertyValue for FormalParameters
Add PropertyValue to object field for CreateAction of workflow
Add nextflow_schema.json to outdir if present
@fbartusch
Copy link
Author

First, thank you for the review @simleo.

  • Parameters are now added to RO-Crate metadata regardless if they are specified in the configuration or inside Nextflow's workflow files. This was simple, just use session.getBinding().getParams() this should return all parameters that are active during workflow execution.

  • CreateAction of the workflow references now PropertyValues via object. PropertyValue link to the FormalParameter via exampleOfWork.

  • If a license is specified in the configuration (e.g. license = "Apache-2.0") it's added to the Root Data Entity. If the specified string is a valid URL it's added with "license": "{@ id: < URL >}". But the CreativeWork entry for the URL of the license is missing in the RO-Crate metadata. Is this needed or does the URL is enough? If yes I try to add it.

  • I started to think about the nextflow.config thing. There several places where the configuration can reside. The documentation explains it . There can be four different configuration files that take precedence in a specific order. Maybe it doesn't suffice to just copy all the files into the crate, because how can we decide which one takes precedence over the others? I'm thinking about compiling one 'effective' configuration file containing the values that are valid during workflow exeuction and put that one into the crate.

  • I still have to do the PropertyValue stuff to the three CreateActions corresponding to the RNG executions.

@bentsherman
Copy link
Member

session.config should contain the resolved configuration. it also contains the resolved params as session.config.params

@simleo
Copy link

simleo commented Aug 29, 2024

I've tested the latest version: big improvements! As reported above, the main thing that's missing now is completing the information related to the CreateActions corresponding to the execution of RNG. A remark on this version:

  • When the license is not specified in nextflow.config, the root data entity is populated with a "license": "null" property. In this case there should not be any "license" property at all instead.

@simleo
Copy link

simleo commented Aug 29, 2024

  • If a license is specified in the configuration (e.g. license = "Apache-2.0") it's added to the Root Data Entity. If the specified string is a valid URL it's added with "license": "{@ id: < URL >}". But the CreativeWork entry for the URL of the license is missing in the RO-Crate metadata. Is this needed or does the URL is enough? If yes I try to add it.

If a "license": {"@id": "< URL >"} property is added to the root data entity, then it would be appropriate to add an entity with that id to the crate. For instance:

{
    "@id": "./",
    "@type": "Dataset",
    "license": {
        "@id": "https://spdx.org/licenses/Apache-2.0"
    },
    ...
},
{
    "@id": "https://spdx.org/licenses/Apache-2.0",
    "@type": "CreativeWork"
}

* Add objects to CreateActions
* Add license entity if license information in nextflow.config is a valid URL
* Fix bug: Don't add unpublished files to CreateAction results. Causes NullPointerException, as they aren't contained in workflowOutputs map
* (Hopefully) all input/output files are now copied into the crate
* Add Intermediate results ro-crate-metadata.json
* Add nextflow.config to ro-crate-metadata.json
* Add datePublished to Dataset
* valid RO-Crate according to rocrate-validator

Signed-off-by: fbartusch <[email protected]>
* Change HowToStep position to string
* Remove nextflow.config object from OrganizeAction
* add author to root data entity
* passes roc-validator with severity REQUIRED

Signed-off-by: fbartusch <[email protected]>
@fbartusch
Copy link
Author

I tested the created RO-Crate with the new 0.4.0 version of rocrate-validator and fixed the last requirements, @simleo.
Added datePublished to Datasetand author to the root data entity. You can find the newest metadata file for the nf-prov example workflow here.

rocrate-validator validate --requirement-severity REQUIRED --profile-identifier provenance-run-crate-0.5 --no-fail-fast --output-format json --output-file /tmp/nf-prov/1.1.0/ro-crate-validation.json  <crate>

{
    "rocrate": "/tmp/nf-prov/1.1.0/demo/",
    "validation_settings": {
        "data_path": "/tmp/nf-prov/1.1.0/demo/",
        "profiles_path": "/home/felix/miniconda3/envs/roc-validator_python3.9_env/lib/python3.9/site-packages/rocrate_validator/profiles",
        "profile_identifier": "provenance-run-crate-0.5",
        "inherit_profiles": true,
        "requirement_severity": "REQUIRED",
        "abort_on_first": false
    },
    "passed": true,
    "issues": []
}

I ran rocrate-validator with --requirement-severity RECOMMENDED and get a very long list of of required and recommended things I should add. I'm working on them now.

@simleo
Copy link

simleo commented Oct 28, 2024

I ran rocrate-validator with --requirement-severity RECOMMENDED and get a very long list of of required and recommended things I should add. I'm working on them now.

Hold on, @fbartusch, there seems to be a problem with the JSON output format. It looks like it's reporting as REQUIRED some issues that actually correspond to RECOMMENDED shapes. @kikkomep is investigating the problem.

@kikkomep
Copy link

I ran rocrate-validator with --requirement-severity RECOMMENDED and get a very long list of of required and recommended things I should add. I'm working on them now.

Hold on, @fbartusch, there seems to be a problem with the JSON output format. It looks like it's reporting as REQUIRED some issues that actually correspond to RECOMMENDED shapes. @kikkomep is investigating the problem.

The issue has been fixed in the latest release, version 0.4.2

Comment on lines 273 to 282
final propertyValues = params
.collect { name, value ->
[
"@id" : "#${name}-pv",
"@type" : "PropertyValue",
"exampleOfWork": ["@id": "#${name}"],
"name" : name,
"value" : value
]
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During our Hackathon the following issue came up:
Most nf-core pipelines deal with reference genome options (mainly igenomes). The way the #genome-pv section of the final ro-crate json is created leads to the whole rendered igenomes.config (example here) to be added as value which makes "value" : {} nested and thus leads to an invalid crate. For our minimal example we just removed the reference genome option which then led to a non-nested crate again.

Our idea was to catch the genomes-pv case with an if statement, then copy the rendered igenomes.config to the results directory and point the crate value entry to that file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for identifying this problem. It didn’t occur during my tests with the demo pipeline. Catching the genomes-pv case might help for nf-core workflows that use iGenomes, but it doesn’t address the underlying issue. Off the top of my head, I’d suggest putting the entire content into one big string and storing it as a value. I’ll try that this afternoon and check if the validator accepts it.

Copy link

@famosab famosab Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed the igenomes reference from our small test pipeline for now. But there has to be a more elegant way :) I also tried some stuff on top of what you did and expanded the ro-crate (see #37).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran rnaseq 3.14.0 on our cluster (where we have igenomes) and validated ro-crate-metadata.json with the newest validator 0.4.5. The validator does not complain about the value in the #genome-pv section.
I pasted the section in a JSON validator and it's also valid JSON.
Can you upload your problematic RO-Crate somewhere and tell me which rocrate-validator version you used?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mburridge96 I think you have the better overview about the issue with the nested crates. Maybe you can explain :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst it's valid JSON, the RO-Crate spec requires every object within the @graph array to be flat. So every object that's in the array must have an "@id", "@type" and then any additional key:value pairs. If nested data is required, you then point to another object that describes it.

So below is valid:

"@graph" : [{
"@id" : "URI/#unique_local",
"@type": "schema_type", 
"name":  "example",
"description": "description of example",
"about": {"@id": "linktoabout"},
}, {
"@id": "linktoabout",
"@type": "schema_type"
}]

but then the following is not:

"@graph" : [{
"@id" : "URI/#unique_local",
"@type": "schema_type", 
"name":  "example",
"description": "description of example",
"about": {
"@id": "linktoabout",
"@type": "schema_type"
}}]

The issue is if we dump everything in the value key it won't conform to the RO-Crate spec and I don't think (could be wrong) that the validator yet checks for that yet which is another problem in itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@famosab & @mburridge96 I think I fixed this in 4d22b84.
I'm checking if the value is a Groovy List or Map and save it as String instead of nested JSON.
It works with the latest validator.

* Check for nested PropertyValues
* If nested, serialize it to JSON and write the serialization as one string into the PropertyValue value field.
* current rocrate-validator (0.4.6) is happy with this solution :)
* Add organization to ro-crate-metadata.json and to agent

Signed-off-by: fbartusch <[email protected]>
* Add publisher option to configuration
* Add the publisher to the ro-crate-metadata.json if the ID corresponds to an organization or agent

Signed-off-by: fbartusch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants