standards for wrapping the command line client #4

kdaily · 2019-03-08T21:56:03Z

We have a handful of implementations on wrappers to upload and download from Synapse in a workflow. Given this is such a fundamental piece, we have the opportunity to be very intentional about how we do it. The command line client provides a natural place for this kind of work to happen in a workflow. We should discuss the pros and cons of how to wrap the command line client in CWL tools.

https://github.com/GA4GH-DREAM/dockstore-tool-synapse-get is really nice, but uses a separate shell script to parse options.

https://github.com/Sage-Bionetworks/Andrew_cwl_workflows/tree/master/utils/python_synapse_client has most implemented, but with a different command line interface than the actual command line client; relies on it's own docker file too.

https://github.com/Sage-Bionetworks/ChallengeWorkflowTemplates/blob/master/download_from_synapse.cwl - similar to the above

Some thoughts:

which docker container (specify by version)
how to provide the Synapse config
where to put subcommand logic (e.g., synapse get -q as a separate tool, or use the above method for writing a shell wrapper to parse the arguments?)
minimize number of tests needed and rely on the clients test harness

The text was updated successfully, but these errors were encountered:

jaeddy · 2019-03-11T19:50:51Z

Thanks for kickstarting this discussion!

In terms of high level design philosophy for CWL, I like this comment from John Pellman's blog post:

CWL also enforces good practices in scientific computing, by emphasizing interfaces over implementations. When you wrap a command line tool in CWL, you not only have something practical that allows you to construct a workflow in Rabix or Cromwell, you also have a specification of what inputs and outputs a typical scientist in your field would want to have for a particular step in an analysis. If you have such a specification and suddenly find that current tools in your field are inadequate for some reason (speed, method has inadequate precision, incorrect outputs), you can create another tool with the same name that takes in the same inputs and outputs but uses totally different logic under the hood. In this way, a tool that would benefit from GPU computing can be converted without breaking a previously defined workflow or a tool that was originally was CPU-limited can be rewritten to leverage FPGAs or ASICs for great performance gains.

I think this has a view implications:

By specifying our CWL to the interface we think is most important to end users, we reduce (at least some) complexity in the descriptor itself, while leaving ourselves more flexible to change implementations. For example, the CLI (Python) might be the most obvious entrypoint to a Synapse client today, but maybe someone comes along and writes an amazingly performant library in Node.js... we'd want to be able to take advantage and switch how the tool works under the hood, without having to rewrite a bunch of existing workflows that use the tool (besides maybe a version bump). In other words, the CWL is a sort of "contract" for how the tool should operate — the folks at CWL have even discussed further formalizing this idea, but I'm not sure what the current status is.
The separation of interface and implementation also lowers the burden for developers. For example, when we use Python or bash scripts to translate CWL instructions into the commands that are actually processed by the underlying software, that's generally more efficient than trying to exhaustively encode the detailed logic directly in CWL.
Based on the above points, I'd advocate renaming this repo as synapse-client-cwl-tools ... :-)

For example/reference, if one were to create a CWL tool for synapse get based on the CLI as specified...

(generating CWL from `argparse` — detailed steps)

Create a conda environment:

conda create -n syn-cwl python=3.6 ipython
conda activate syn-cwl

Install the Synapse client and CWL:

pip install synapseclient cwlref-runner

Clone and install argparse2tool:

git clone https://github.com/erasche/argparse2tool
cd argparse2tool
pip install .

Modify the synapseclient:__main__.py code to enable running as a script:

cd ~/anaconda/envs/syn-cwl/lib/python3.6/site-packages/synapseclient
vi __main__.py

import argparse
import os
import collections
import sys
import synapseclient
import synapseutils
- from . import Activity
+ from synapseclient.activity import Activity
import signal
import json
- from .exceptions import * 
+ from synapseclient.exceptions import *
- from .wiki import Wiki
+ from synapseclient.wiki import Wiki

Run the following command to generate CWL:

PYTHONPATH=$(argparse2tool_check_path -q) python __main__.py get --generate_cwl_tool

(see the argparse2tool repo for more info and options)

... here's what you get (note: outputs would need to be specified subsequently):

inputs:

  queryString:
    type: ["null", string]
    doc: Optional query parameter, will fetch all of the entities returned by a query (see query for help).
    inputBinding:
      prefix: --query

  version:
    type: ["null", int]
    doc: Synapse version number of entity to retrieve. Defaults to most recent version.
    inputBinding:
      prefix: --version

  recursive:
    type: ["null", boolean]
    default: False
    doc: Fetches content in Synapse recursively contained in the parentId specified by id.
    inputBinding:
      prefix: --recursive

  followLink:
    type: ["null", boolean]
    default: False
    doc: Determines whether the link returns the target Entity.
    inputBinding:
      prefix: --followLink

  limitSearch:
    type: ["null", string]
    doc: Synapse ID of a container such as project or folder to limit search for files if using a path.
    inputBinding:
      prefix: --limitSearch

  downloadLocation:
    type: ["null", string]
    default: ./
    doc: Directory to download file to [default - %(default)s].
    inputBinding:
      prefix: --downloadLocation

  id:
    type: ["null", string]
    doc: Synapse ID of form syn123 of desired data object.
    inputBinding:
      position: 1

outputs:
    []

That seems like a pretty decent scope to me. Part of the reason that @thomasyu888 and I originally added the wrapper shell script was because, at the time, there wasn't an option to specify a path for your Synapse config file (i.e., it had to be saved at ~/.synapseConfig). Dealing with files vs. directories might have been an issue as well. Looking at this today, however, I'm fairly confident this could be handled well enough by pure CWL — while I'd propose a few tweaks, synapse-get-tool.cwl is a good start.

For what it's worth, I think it would ultimately make more sense to generate our CWL "interface" based on the API specification (not the client) — but I'm not sure what tooling exists around this.

As to @kdaily's thoughts:

which docker container (specify by version)

I certainly think "best practice" for writing CWL tools should be to include the tagged version of a specific docker container (e.g., dockerPull: sagebionetworks/synapsepythonclient:v1.9.2-rc). As for which container, I think it depends on what is needed to translate between an interface and the implementation: if you need to add a wrapper script, then that should be included as part of the Dockerfile and used to define a new image. Keeping track of versions and dependencies in a transparent way (e.g., through GitHub or Dockstore) becomes important, but is hopefully something we could improve/automate with CI pipelines.

how to provide the Synapse config

For today, because it's currently an accepted option for the CLI, I think we can just provide it with a corresponding parameter and binding. That said, I realized the argparse2tool conversion above does not include "global" options for the client — e.g.:

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -u SYNAPSEUSER, --username SYNAPSEUSER
                        Username used to connect to Synapse
  -p SYNAPSEPASSWORD, --password SYNAPSEPASSWORD
                        Password used to connect to Synapse
  -c CONFIGPATH, --configPath CONFIGPATH
                        Path to configuration file used to connect to Synapse
                        [default: /Users/jaeddy/.synapseConfig]
  --debug
  -s, --skip-checks     suppress checking for version upgrade messages and
                        endpoint redirection

... so we might want to add those. We could be slightly opinionated though, and opt not to support user/password.

For longer term handling of credentials, I elect to punt for now.

where to put subcommand logic (e.g., synapse get -q as a separate tool, or use the above method for writing a shell wrapper to parse the arguments?)

This is probably more subjective — distinguishing between what should be considered a "module" or "operation" vs. a specific option or parameter thereof. For the Synapse client, I don't think we need to get more granular than the top level commands.

minimize number of tests needed and rely on the clients test harness

Agreed. If we think of the CWL tools as "clients of the client" we can think about structuring tests in a similar way: the job of the tool is to form "inputs" and handle "outputs" correctly. It'd be great to discuss what these tests might look like in practice.

sgosline · 2019-03-12T16:58:13Z

so, where does this leave us?

kdaily assigned jaeddy, andrewelamb and thomasyu888 Mar 8, 2019

andrewelamb removed their assignment Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

standards for wrapping the command line client #4

standards for wrapping the command line client #4

kdaily commented Mar 8, 2019

jaeddy commented Mar 11, 2019

sgosline commented Mar 12, 2019

standards for wrapping the command line client #4

standards for wrapping the command line client #4

Comments

kdaily commented Mar 8, 2019

jaeddy commented Mar 11, 2019

sgosline commented Mar 12, 2019