-
Notifications
You must be signed in to change notification settings - Fork 71
Somatic Filtering
gridss_somatic_filter.R
performs a number of filtering steps.
LOCAL_LINKED_BY
, REMOTE_LINKED_BY
containing identifiers indicating structural variant phasing. Breakends sharing an identifier have been determined to phased. The following prefixes are currently used:
prefix | phasing | meaning |
---|---|---|
asm | definitely cis | assembly contig span across both breakpoints. |
tra | definitely cis | transitive breakpoint found. The presence of A-B, B-C, and an imprecise A-C breakpoint call indicates that A-B and B-C are phased and the A-C is due to read pairs spanning across B. |
bpbp | likely cis | nearby breakpoints in opposite orientations consistent with a translocation/templated insertion of distal sequence |
bpbe | likely cis | nearby breakpoint and breakend in opposite orientations. Likely to be a templated insertion. |
inv | likely cis | breakpoints appear to be a simple inversion. |
For breakpoints more than 1000bp apart, the SC
field can be used to determine if two nearby events are phased trans. If the (homology-adjusted) width of the CIGAR encoding the interval of support spans past the position of the adjacent breakpoint, then the events are not adjacent on the same chromatid.
Complicating matters further, if a breakpoint has been amplified and multiple copies exist, it could be simultaneously cis and trans with another breakpoint if only some of the copies are adjacent.
A TAF
field is added that is an estimate of the average variant allele fraction across all tumour samples. This field is not purity adjusted.
A description for each VCF filter can be found in the VCF header of the somatic filtering script.
- If a reference genome is not supplied, the
small.replacement.fp
filter will not be applied as, without knowning the reference sequence, is it not possible to determine if the replacement sequence corresponds to a simple inversion.
In cancer genomes, SVs frequently occur in close proximity (less than 500bp). In such cases, read pairs can span across one or more short inner fragments. Transitive call reduction filters out the transitive calls, and annotates the spanned called
For example if DNA segments A - B - C - D are connected, then, with small B and C, transitive read pairs support will be present for A-C, A-D, and B-D and imprecise variant calls made for these. In this example, A-C, A-D, and B-D variants will be filtered, and A-B, B-C, and C-D with be annotated as linked by the filtered transitive calls.
In some circumstances, a variant may be incorrrectly reported in multiple variant.
For example, a breakpoint in which one side occurs in low mappability sequence may have a single breakend variant call with high QUAL score, and a breakpoint call to one of candicate locations with a low mappability score (typically these are due to the aligner overestimating the mapq of some reads). To adjust for this, variants sharing a breakend (with 5bp), whose sequences have an edit distance of less than 0.1 per base will be annotated in the LOCAL_LINKED_BY
field with a eqv
prefix.
As breakpoint sequence determination requires a reference genome, this step is not performed if a reference genome is not supplied.
If a GRIDSS assembly breakend spans across multiple structural variants, this variants can be phased as cis. Assembly-phased variants are be annotated with a asm
prefix in the LOCAL_LINKED_BY
and REMOTE_LINKED_BY
fields.
Events occurring nearby can sometimes be linked according to known variant types. The following event type linkages are annotated:
-
bebe
: two adjacent breakend. Indicative of a simple insertion of non-reference or repetative sequence. Common for LINE insertions. -
bebp
: templated insertion in which only one side can be unambiguously place. Common for LINE insertions due to the poly(A) tail causing assembly truncation and a single breakend variant on one side. -
inv
: simple inversion -
dsb
: likely double-stranded break
Variants in which the supporting fragment counts differ by more than gridss.min_rescue_portion
, they will not be linked unless AAAAAA
occurs in the inserted sequence of either variant.
In some cases, a variant will be linked via multiple mechanisms or variants. In such cases, only the linkage to the highest QUAL event will be kept.
Finally, event links to variants that are PON filters are removed.
Low quality variants are rescued and included in the high confidence somatic call set if they are linked to a variant included in the high confidence call set by a mechanism other than equivalence (eqv
).
If one breakend in a breakpoint is filtered, the other breakend is also filtered.
The somatic filtering script uses a number of configuration settings from gridss_config.R
.
Distance between breakends for defining a small events. Distance is defined in terms of the nominal position (i.e. the middle of any range of uncertainty.
Allowable level of contamination of tumour reads in the normal. The default is 3% (0.03).
Note that we have found that small amount of flow cell cross-contamination occurs on Illumina sequencer, so a few reads from amplified tumour regions can be seen on all samples on the same sequencing run.
Minimum depth across the breakend in the matched normal.
Minimum number of reads providing direct support the variant.
Maximum lenght of an exact sequencing homology across a breakpoint.
gridss.max_inexact_homology_length = 50
Maximum lenght of an inexact sequencing homology across a breakpoint.
Maximum allowable strand bias in the soft clipped/split read support for short events.
Minimum QUAL score required to report a breakpoint
Minimum QUAL score multiplier require for single breakend calling.
Minimum tumour allele fraction
Minimum number of panel of normal samples required to filter a variant
Maximum gap (in either direction) between breakends to annotate a pair of breakpoints as a double-stranded break.
Maximum gap (in either direction) between breakends to annotatea pair of variants as an insertion.
The default of 35 matches the logic used by manta.
Maximum gap (in either direction) between breakpoints to annotate a pair of variants as a simple invertion
Maximum size of an translocation/templated insertion to consider when performing transitive reduction
Filters that will cause a variant to be excluded from both the somatic and the full somatic output files.
Valid values are VCF filter except for qual
.
Filters that do not exclude the variant from the somatic output file.
Valid values are VCF filter except for qual
.
Minimum percentage of the rescuing variant that a rescued variant will be rescued by. This limit prevents a noise variant from being rescued by a high quality variant. Portion is calculated using total supporting read count, not QUAL score.
Minimum simple event size to report. The default matches the minimum event size reported by GRIDSS (so, using default settings, this filter does nothing).