-
Notifications
You must be signed in to change notification settings - Fork 13
Annotation
The Bystro annotator is a program that adds information/features/annotations to genetic variants from sources such as refSeq, gnomAD, and CADD.
The features that Bystro can annotate are defined in a YAML configuration file. This is a static definition of the maximum set of features that Bystro can provide, and this YAML is passed as an argument during annotation.
A somewhat simplified view of the YAML configuration is below
---
assembly: hg38
database_dir: /path/to/embedded/database
tracks:
outputOrder:
- ref
- refSeq
- cadd
- gnomad.genomes
tracks:
- name: ref
type: reference
- name: cadd
type: cadd
- features:
- name
- name2
- description
- kgID
- mRNA
- spID
- spDisplayID
- protAcc
- rfamAcc
- tRnaName
- ensemblID
name: refSeq
type: gene
- features:
- alt
- id
- af: number
- an: number
- an_afr: number
- an_amr: number
- an_asj: number
- an_eas: number
- an_fin: number
- an_nfe: number
- an_oth: number
- an_sas: number
- an_male: number
- an_female: number
- af_afr: number
- af_amr: number
- af_asj: number
- af_eas: number
- af_fin: number
- af_nfe: number
- af_oth: number
- af_sas: number
- af_male: number
- af_female: number
name: gnomad.genomes
type: vcf
There are a number of moving pieces here, so let's focus on the piece related to adding or removing annotation features:
- There is a top-level
tracks
object, which has 2 keys:outputOrder
andtracks
- The inner
tracks
is an array of track definitions. You can think of a track as a set of features that come from one input source, e.g. CADD, dbSNP, gnomAD, etc. - Each track must contain the following key properties:
-
name
that defines the track name -
type
one of a series of types (e.g. sparse, gene, vcf, cadd, etc.) , which we'll describe in a separate section of this document
-
- For tracks that have more than 1 grouping of information, in addition to
name
andtype
they will include:-
features
that include each grouping of information as a separate field (for refSeq,features
include the transcript labels inname
, the gene labels inname2
, the transcript descriptions indescription
and so on)
-
- The
outputOrder
must contain every listname
from the innertracks
array
YAML configuration files can be modified. Adding new annotations to them (defining new tracks or features) requires a build step, which pre-compiles the track's input data into a super fast embedded database, which enables millions of queries per minute on even modest machines.
Removing annotations is much simpler. Let's say you were using the above YAML and didn't need the description
annotation in the refSeq
track, which contains a long-form description of a transcript. To remove this, simply remove the line - description
from the name: refSeq
track's features
array.
Similarly, entire tracks can be dropped. If we wanted to annotate our VCF without CADD scores, we would remove the following lines:
- name: cadd
type: cadd
from the inner tracks
array, and also remove - cadd
from outputOrder
.