Skip to content

Conversation

pauldg
Copy link
Contributor

@pauldg pauldg commented Jan 19, 2023

Example Galaxy Workflowrun ro-crate which includes Galaxy features: collections and parameters

@simleo
Copy link
Collaborator

simleo commented Jan 23, 2023

The CreateAction needs to be linked to from the root data entity via mentions:

{
    "@id": "./",
    "@type": "Dataset",
    "mentions": {"@id": "#b91b07ec-5752-465d-a0c4-912e0312abc0"},
    ...
}

@simleo
Copy link
Collaborator

simleo commented Jan 23, 2023

The CreateAction has no startTime or endTime; it should have at least an endTime. Is this info available somehow? E.g., latest output creation date.

{
"@id": "#lineNum-param",
"@type": "FormalParameter",
"additionalType": "None",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "Integer". If it's not straightforward to convert from the Galaxy type, better not add the "additionalType" property at all ("None" is not in the RO-Crate context).

{
"@id": "#advanced-param",
"@type": "FormalParameter",
"additionalType": "None",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be "Text" (looking at the value in the corresponding pv). If it's not straightforward to convert from the Galaxy type, better not add the "additionalType" property at all ("None" is not in the RO-Crate context).

"valueRequired": true
},
{
"@id": "urn:uuid:eabc423a-227e-4096-8e14-74d0088c8ef9",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be changed to "#eabc423a-227e-4096-8e14-74d0088c8ef9", same for other IDs like this one

@stain
Copy link
Contributor

stain commented Feb 22, 2023

Should also expand conformsTo to match now released https://www.researchobject.org/workflow-run-crate/profiles/workflow_run_crate

        "@id": "./",
        "@type": "Dataset",
        "conformsTo": [
            {"@id": "https://w3id.org/ro/wfrun/process/0.1"},
            {"@id": "https://w3id.org/ro/wfrun/workflow/0.1"},
            {"@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0"}
        ],

and their static contextual entities:

    {   "@id": "https://w3id.org/ro/wfrun/process/0.1",
        "@type": "CreativeWork",
        "name": "Process Run Crate",
        "version": "0.1"
    },
    {   "@id": "https://w3id.org/ro/wfrun/workflow/0.1",
        "@type": "CreativeWork",
        "name": "Workflow Run Crate",
        "version": "0.1"
    },
    {   "@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0",
        "@type": "CreativeWork",
        "name": "Workflow RO-Crate",
        "version": "1.0"
    },

@stain
Copy link
Contributor

stain commented Feb 22, 2023

The various _attrs.txt files seems useful for Galaxy debugging, but don't appear in the RO-Crate metadata JSON, so it's a bit cryptic what they are for or relate to.

They seem to be JSON files, but have the .txt extension - so they can use encodingFormat to explain that in the metadata. Ideally they can also link to their own conformsTo if there is some documentation about each.

"additionalType": "None",
"description": "select 3 lines",
"name": "select lines parameter",
"valueRequired": true
Copy link
Collaborator

@simleo simleo Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "valueRequired": "True", i.e., pointing to True (same for other occurrences)

@pauldg
Copy link
Contributor Author

pauldg commented Mar 2, 2023

I'm not sure if there's a way to reply with the updated output comment by comment?
In any case, please let me know if these changes are adequate.

@simleo
Copy link
Collaborator

simleo commented Mar 3, 2023

please let me know if these changes are adequate

The changes look good! However, there's another problem I hadn't noticed before: the representation of inputs and outputs does not match the workflow's structure. The workflow takes two collections and merges them into a single one, then concatenates the datasets from the merged collection into a single dataset and finally selects some lines from the concatenated dataset. The input parameters of the workflow should be the two input collections and the parameter that controls the number of lines for the final selection (plus the advanced parameter, which seems to control the handling of merge conflicts), while the output should be the file containing the selected lines. The current metadata file has individual files from the collections as inputs instead; also, inputs include the concatenated dataset, which is an intermediate output.

The workflow's input and output should look like this:

{
    ...
    "input": [
        {"@id": "#lineNum-param"},
        {"@id": "#advanced-param"},
        {"@id": "#collection1-param"},
        {"@id": "#collection2-param"}
    ],
    "output": [
        {"@id": "#4a0f4078-5aff-4e02-9f9c-4ad510050e54"}
    ],
    ...
},
{
    "@id": "#collection1-param",
    "@type": "FormalParameter",
    "additionalType": "Collection",
    "name": "collection 1"
},
{
    "@id": "#collection2-param",
    "@type": "FormalParameter",
    "additionalType": "Collection",
    "name": "collection 2"
}

The action's object and result should look like this:

{
    ...
    "object": [
        {"@id": "#lineNum-pv"},
        {"@id": "#advanced-pv"},
        {"@id": "#dataset_collection-11"},
        {"@id": "#dataset_collection-10"}
    ],
    "result": [
        {"@id": "datasets/Select_first_on_data_48_49.txt"}
    ],
    ...
}

with #dataset_collection-11 pointing to #collection1-param via exampleOfWork, and similar for the other collection. Individual input files should not have an exampleOfWork (they participate as members of their collections). The intermediate merged collection should not be in the crate at all.

That was the main thing. I've also found two minor issues:

  • The collections should not have the "list" additionalType. That's not understood in the RO-Crate context. If there is a URI that leads to a description of the list type in Galaxy, that would be a good value. If not, better remove the additionalType entry.
  • The abstract CWL version of the workflow needs to be referred to from the main workflow via subjectOf

I have pushed here the full expected metadata file with all the changes.

@pauldg
Copy link
Contributor Author

pauldg commented Mar 6, 2023

Thanks Simone, that clears up some of the details of the format. I'll continue with it.

@pauldg
Copy link
Contributor Author

pauldg commented Mar 19, 2023

I've made further updates to the code addressing the required changes. Unfortunately the diff with the previous version is a bit difficult to make since I moved around some parts.

A few things to note:

  • The intermediary collection and the concatenated collection are both defined in the workflow as an output and thus they are listed as outputs. So far I have seen this be the case for all intermediary outputs in galaxy workflows, but perhaps there is an option to not include intermediary outputs as final outputs of the workflow.
  • The .gxwf.yml is the new standard representation for galaxy workflows so I've made that the main entity and the cwl representation is connected using subjectOf there.
  • The advanced parameter, which controls the handling of merge conflicts, is a tool parameter rather than a workflow parameter, which means that I'm not able to provide a different value to this parameter when (re-)running the workflow. It became "hardcoded" in the workflow definition when I created the workflow. The only way to change the value of this parameter would be to change the workflow definition. On the other hand for the num_lines_param I have enabled this to be a workflow parameter and so I can provide different values for it every time I rerun the workflow (the value is than provided to the tool at runtime). The question is thus whether the advanced parameter should be included in the ro-crate at all?

@simleo
Copy link
Collaborator

simleo commented Mar 20, 2023

The intermediary collection and the concatenated collection are both defined in the workflow as an output and thus they are listed as outputs.

OK. If they are workflow outputs for Galaxy, they have to be listed as workflow outputs in the RO-Crate as well.

Regarding the "advanced" parameter, if it's not enabled as a workflow parameter then it should not be included in the RO-Crate. However, I have some comments regarding its representation, which would become relevant in those cases where such parameters would have to be included. In the current version of the example, the PropertyValue is:

{
    "@id": "#advanced-pv",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#advanced-param"},
    "name": "merge collections tool PropertyValue",
    "value": {
        "conflict": {
            "__current_case__": 0,
            "duplicate_options": "suffix_conflict",
            "suffix_pattern": "_#"
        }
    }
}

I.e., the value has been inserted as JSON and merged with the overall JSON structure, making the RO-Crate invalid. In the previous version, instead, the value was inserted as a string, which is OK:

    "value": "{\"conflict\": {\"__current_case__\": 0, \"duplicate_options\": \"suffix_conflict\", \"suffix_pattern\": \"_#\"}}"

Also, I think that "advanced" refers to the whole set of "hidden" parameters in the Galaxy interface, and there could be more than one. So the parameter should actually be called "conflict", leading to something like:

{
    "@id": "#conflict-pv",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#conflict-param"},
    "name": "conflict",
    "value": "{\"__current_case__\": 0, \"duplicate_options\": \"suffix_conflict\", \"suffix_pattern\": \"_#\"}"
},
{
    "@id": "#conflict-param",
    "@type": "FormalParameter",
    "additionalType": "Text",
    "name": "conflict",
    "valueRequired": "False"
},

But, again, in this specific case the parameter should not be included.

Here's a list of issues I've found in the current version of the example:

  • Some exampleOfWork links are broken because they are missing the leading hash mark. For instance, dataset_collection-10-param should be #dataset_collection-10-param.
  • There's a reference to #num_lines_param-param, but the entity is not in the crate
  • The additionalType for formal parameters corresponding to collections should be Collection
  • There's a duplicate reference to datasets/hello_33.txt in #dataset_collection-13

I've pushed the expected metadata file according to the above changes here.

@pauldg
Copy link
Contributor Author

pauldg commented Mar 24, 2023

I agree with all changes, the only one I have doubts about is this:

  • There's a duplicate reference to datasets/hello_33.txt in #dataset_collection-13

This is intended since the two input collections reference the same input dataset and the input datasets use the filename as the id:

{
            "@id": "#dataset_collection-11",
            "@type": "Collection",
            "hasPart": [
                {
                    "@id": "datasets/hello_33.txt"
                },
                {
                    "@id": "datasets/world_34.txt"
                }
            ],
        },

and

        {
            "@id": "#dataset_collection-10",
            "@type": "Collection",
            "hasPart": [
                {
                    "@id": "datasets/hello_33.txt"
                },
                {
                    "@id": "datasets/universe_31.txt"
                }
            ],
        },

@simleo
Copy link
Collaborator

simleo commented Mar 27, 2023

This is intended since the two input collections reference the same input dataset and the input datasets use the filename as the id

Yes, but when you merge the collections only one of the datasets with a repeated name is included in the merged collection. This is explained here (in the "Merge collections" subsection). From the RO-Crate metadata file's standpoint it's the same, duplicate entries do not make sense: though multiple values are represented as JSON lists, their JSON-LD semantics is basically that of sets.

@pauldg
Copy link
Contributor Author

pauldg commented Mar 27, 2023

Updated the code base to address the required changes.

About the merged collections, in the example workflow the collections are merged using the advanced parameter that handles conflicts (see the screenshot bellow). So there are two references to one dataset but the two elements of the collection do receive a unique "element identifier". Should this be expressed somehow in the ro-crate metadata?

image

@simleo
Copy link
Collaborator

simleo commented Mar 27, 2023

About the merged collections, in the example workflow the collections are merged using the advanced parameter that handles conflicts (see the screenshot bellow). So there are two references to one dataset but the two elements of the collection do receive a unique "element identifier". Should this be expressed somehow in the ro-crate metadata?

Since the selected conflict handler appends suffixes to conflicted element identifiers, the same can be done in the RO-Crate: the generator can add two copies of datasets/hello_33.txt to the crate, named datasets/hello_33_1.txt and datasets/hello_33_2.txt, then the hasPart of the merged collection can be:

"hasPart": [
    {
        "@id": "datasets/hello_33_1.txt"
    },
    {
        "@id": "datasets/world_34.txt"
    },
    {
        "@id": "datasets/hello_33_2.txt"
    },
    {
        "@id": "datasets/universe_31.txt"
    }
]

@simleo
Copy link
Collaborator

simleo commented Mar 28, 2023

An alternative is to slightly change the example workflow to use the default conflict resolution, which keeps only one copy.

@pauldg
Copy link
Contributor Author

pauldg commented Mar 28, 2023

An alternative is to slightly change the example workflow to use the default conflict resolution, which keeps only one copy.

I've changed the conflict resolution parameter for merging collections to the default, keep first.
Some of the elements in ro-crate metadata have been reordered and renamed.

@simleo
Copy link
Collaborator

simleo commented Mar 30, 2023

Looks good! Merging. For the Zenodo upload, please use a zip file: if you do that, Zenodo recognizes the format and generates a summary of contained files to display in the record's page. If you can, use the .crate.zip extension, so it's compatible with WorkflowHub. Note that the zip needs to contain directly the contents of the RO-Crate, so that ro-crate-metadata.json is at the top level.

@simleo simleo merged commit 5390cd7 into ResearchObject:main Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants