Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for new-style imports #70

Open
2 of 4 tasks
holtgrewe opened this issue Nov 1, 2023 · 3 comments · May be fixed by #72
Open
2 of 4 tasks

Implement support for new-style imports #70

holtgrewe opened this issue Nov 1, 2023 · 3 comments · May be fixed by #72
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@holtgrewe
Copy link
Contributor

holtgrewe commented Nov 1, 2023

Is your feature request related to a problem? Please describe.
The new-style imports (based on depositing files in an external storage and registering the case as phenopackets) is currently unsupported in VarFish.

Describe the solution you'd like
Implement the import.

  • implement parsing of per-project TOML configuration
  • implement varfish-cli projects project-load-config PROJECT_UUID (see below)
  • implement validating phenopackets YAML files
  • implement submission of phenopackets YAML files

Describe alternatives you've considered
N/A

Additional context
N/A

@holtgrewe holtgrewe self-assigned this Nov 1, 2023
@holtgrewe holtgrewe added this to the 0.6.0 milestone Nov 1, 2023
@holtgrewe
Copy link
Contributor Author

holtgrewe commented Nov 1, 2023

Specification: Extending Client Configuration

New-style imports deposit files in external storage. We thus need to make projects known to varfish. This should be done in the ~/.varfishrc.toml file.

Here, is how to create a list of projects in general in toml

# ...

[[projects]]
uuid = "..."

[[projects]]
uuid = "..."

This will be loaded as {'projects': [{'uuid': '...'}, {'uuid': '...'}]} in JSON/Python.

Users can configure projects with the following schema:

[[projects]]
title = "..."  # optional; user-readable project title
uuid = "..."  # SODAR project UUID
# protocol to use for import
import_data_protocol = "s3" # one of "s3" | "http" | "https" | "file"
import_data_path = "..."      # path prefix to use
import_data_port = 80         # optional; port to user for connecting on import
import_data_user = "user"     # user/S3 access key
import_data_password = "key"  # password/S3 secret key to use

We should support users with the possibility to download these settings via the following command. This should fetch the settings from above from the server and append to the projects configuration in the TOML.

varfish-cli projects project-load-config PROJECT_UUID

@holtgrewe
Copy link
Contributor Author

holtgrewe commented Nov 1, 2023

Specification: Manifest Files

This follows the phenopackets YAML format supported by VarFish Server.

General note on files:

  • list of designations should be taken from varfish-server code and documented also in varfish-cli
  • same for mimetypes

Notes on individuals' files:

  • only one BAM file supported for each individual
  • first file for each individual is the sequencing kits
    • special meaning, should start with s3://varfish-server/seqmeta/enrichment-kits and refers to the internal files
  • BAM files will only be registered as external files

Notes on family files:

  • only one seqvars VCF allowed, all strucvars VCFs will be merged
- phenopackets YAML example 📁
# family with only metadata field
family:
  proband:
    id: index
    subject:
      id: index
      sex: MALE
      karyotypicSex: XY
    phenotypicFeatures:
      - type:
          id: "HP:0012469"
          label: "Infantile spasms"
        excluded: false
        modifiers:
          - id: "HP:0031796"
            label: "Recurrent"
    measurements:
      - assay:
          id: NCIT:C158253
          label: Targeted Genome Sequencing
        value:
          ontologyClass:
            id: NCIT:C171177
            label: Sequencing Data File
    files:
      - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz
        individualToFileIdentifiers:
          index: index-PANEL
        fileAttributes:
          checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
          designation: sequencing_targets
          genomebuild: grch38
          mimetype: text/x-bed+x-bgzip
      - uri: s3://data-for-import/example/index.bam
        individualToFileIdentifiers:
          mother: index-PANEL
        fileAttributes:
          checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
          designation: read_alignments
          genomebuild: grch38
          mimetype: text/x-bam+x-bgzip
    diseases:
      - term:
          id: OMIM:164400
          label: "SPINOCEREBELLAR ATAXIA 1; SCA1"
        excluded: false
    metaData: &metadata-prototype
      created: "2019-07-21T00:25:54.662Z"
      createdBy: Peter R.
      resources:
        - id: hp
          name: human phenotype ontology
          url: http://purl.obolibrary.org/obo/hp.owl
          version: "2018-03-08"
          namespacePrefix: HP
          iriPrefix: hp
      phenopacketSchemaVersion: "2.0"
  relatives:
    - id: mother
      subject:
        id: mother
        sex: FEMALE
        karyotypicSex: XX
      phenotypicFeatures:
        - type:
            id: "HP:0012469"
            label: "Infantile spasms"
          excluded: true
      measurements:
        - assay:
            id: NCIT:C158253
            label: Targeted Genome Sequencing
          value:
            ontologyClass:
              id: NCIT:C171177
              label: Sequencing Data File
      files:
        - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz
          individualToFileIdentifiers:
            mother: mother-PANEL
          fileAttributes:
            checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
            designation: sequencing_targets
            genomebuild: grch38
            mimetype: text/x-bed+x-bgzip
        - uri: s3://data-for-import/example/mother.bam
          individualToFileIdentifiers:
            mother: mother-PANEL
          fileAttributes:
            checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
            designation: read_alignments
            genomebuild: grch38
            mimetype: text/x-bam+x-bgzip
      diseases:
        - term:
            id: OMIM:164400
            label: "SPINOCEREBELLAR ATAXIA 1; SCA1"
          excluded: true
      metaData: *metadata-prototype
    - id: father
      subject:
        id: father
        sex: MALE
        karyotypicSex: XY
      phenotypicFeatures:
        - type:
            id: "HP:0012469"
            label: "Infantile spasms"
          excluded: true
      measurements:
        - assay:
            id: NCIT:C158253
            label: Targeted Genome Sequencing
          value:
            ontologyClass:
              id: NCIT:C171177
              label: Sequencing Data File
      files:
        - uri: s3://varfish-server/seqmeta/enrichment-kits/ataxia-panel.bed.gz
          individualToFileIdentifiers:
            father: father-PANEL
          fileAttributes:
            checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
            designation: sequencing_targets
            genomebuild: grch38
            mimetype: text/x-bed+x-bgzip
        - uri: s3://data-for-import/example/father.bam
          individualToFileIdentifiers:
            father: father-PANEL
          fileAttributes:
            checksum: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
            designation: read_alignments
            genomebuild: grch38
            mimetype: text/x-bam+x-bgzip
      diseases:
        - term:
            id: OMIM:164400
            label: "SPINOCEREBELLAR ATAXIA 1; SCA1"
          excluded: true
      metaData: *metadata-prototype
  pedigree:
    persons:
      - familyId: Case
        individualId: index
        paternalId: father
        maternalId: mother
        sex: MALE
        affectedStatus: AFFECTED
      - familyId: Case
        individualId: father
        paternalId: "0"
        maternalId: "0"
        sex: MALE
        affectedStatus: UNAFFECTED
      - familyId: Case
        individualId: mother
        paternalId: "0"
        maternalId: "0"
        sex: FEMALE
        affectedStatus: UNAFFECTED
  files:
    - uri: file://cases_import/tests/data/sample-brca1.vcf.gz
      individualToFileIdentifiers:
        index: NA12878-PCRF450-1
      fileAttributes:
        checksum: sha256:4042c2afa59f24a327b3852bfcd0d8d991499d9c4eb81e7a7efe8d081e66af82
        designation: variant_calls
        variant_type: seqvars
        genomebuild: grch37
        mimetype: text/plain+x-bgzip+x-variant-call-format
    - uri: file://cases_import/tests/data/sample-brca1.vcf.gz.tbi
      individualToFileIdentifiers:
        index: NA12878-PCRF450-1
      fileAttributes:
        checksum: sha256:6b137335b7803623c3389424e7b64d704fb1c9f3f55792db2916d312e2da27ef
        designation: variant_calls
        variant_type: seqvars
        genomebuild: grch37
        mimetype: application/octet-stream+x-tabix-tbi-index
  metaData: *metadata-prototype

@holtgrewe
Copy link
Contributor Author

Specification: Client Side of Import Process

Precondition:

  • project is configured in ~/.varfishrc.toml

Then:

  1. read YAML as phenopackets
  2. check that kit specification BED is there
  3. check designations and mimetypes of files are known
  4. check that at most one BAM file is there for each sample
  5. check that at most one seqvars VCF file is there
  6. check hat the files exist in the storage

@holtgrewe holtgrewe linked a pull request Nov 3, 2023 that will close this issue
@holtgrewe holtgrewe added the enhancement New feature or request label Jan 24, 2024
@holtgrewe holtgrewe moved this to In progress in Release Planning Jan 24, 2024
@holtgrewe holtgrewe moved this from In progress to Stalled in Release Planning Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Stalled
Development

Successfully merging a pull request may close this issue.

1 participant