Skip to content

GMOD/gtf-js

Repository files navigation

@gmod/gtf

Build Status

GTF or the General Transfer Format is identical to GFF version2. This module was created to read and write GTF data. This module aims to be a complete implementation of the GTF specification.

  • streaming parsing and streaming formatting
  • creates transcript features with children_features
  • only compatible with GTF

Note: For JBrowse, we generally encourage GFF3 over GTF

For GFF3, checkout @gmod/gff-js package found here

Install

$ npm install --save @gmod/gtf

Usage

import gtf from '@gmod/gtf'

// parse a file from a file name
gtf.parseFile('path/to/my/file.gtf', { parseAll: true })
.on('data', data => {
  if (data.directive) {
    console.log('got a directive',data)
  }
  else if (data.comment) {
    console.log('got a comment',data)
  }
  else if (data.sequence) {
    console.log('got a sequence from a FASTA section')
  }
  else {
    console.log('got a feature',data)
  }
})

// parse a stream of GTF text
const fs = require('fs')
fs.createReadStream('path/to/my/file.gtf')
.pipe(gtf.parseStream())
.on('data', data => {
  console.log('got item',data)
  return data
})
.on('end', () => {
  console.log('done parsing!')
})

// parse a string of gtf synchronously
let stringOfGTF = fs
  .readFileSync('my_annotations.gtf')
  .toString()
let arrayOfThings = gtf.parseStringSync(stringOfGTF)

// format an array of items to a string
let stringOfGTF = gtf.formatSync(arrayOfThings)

// format a stream of things to a stream of text.
// inserts sync marks automatically.
// note: this could create new gtf lines for transcript features
myStreamOfGTFObjects
.pipe(gtf.formatStream())
.pipe(fs.createWriteStream('my_new.gtf'))

// format a stream of things and write it to
// a gtf file. inserts sync marks
//  note: this could create new gtf lines for transcript features
myStreamOfGTFObjects
.pipe(gtf.formatFile('path/to/destination.gtf')

Object format

features

Because GTF can not handle a 3 level hierarchy (gene -> transcript -> exon), we parse GTF by creating transcript features with children features.

We do not create features from the gene_id. Values that are . in the GTF are null in the output.

ctgA	bare_predicted	CDS	10000	11500	.	+	0	transcript_id "Apple1";

Note: that is creates an additional transcript feature from the transcript id when featureType is not 'transcript'. It will then create a child CDS feature from the line of GTF shown above.

[
    [
        {
            "seq_name": "ctgA",
            "source": "bare_predicted",
            "featureType": "transcript",
            "start": 10000,
            "end": 11500,
            "score": null,
            "strand": "+",
            "frame": "0",
            "attributes": { "transcript_id": [ "\"Apple1\"" ] },
            "child_features": [[
                {
                    "seq_name": "ctgA",
                    "source": "bare_predicted",
                    "featureType": "CDS",
                    "start": 10000,
                    "end": 11500,
                    "score": null,
                    "strand": "+",
                    "frame": "0",
                    "attributes": { "transcript_id": [ "\"Apple1\"" ] },
                    "child_features": [],
                    "derived_features": []
                }
            ]],
            "derived_features": []
        }
    ]
]

directives, comments, sequences

parseDirective("##gtf\n")
// returns
{
  "directive": "gtf",
}

parseComment('# hi this is a comment\n')
// returns
{
  "comment": "hi this is a comment"
}

//These come from any embedded `##FASTA` section in the GTF file.
{
  "id": "ctgA",
  "description": "test contig",
  "sequence": "ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA"
}

API

Table of Contents

parseStream

Parse a stream of text data into a stream of feature, directive, and comment objects.

Parameters

  • options Object optional options object (optional, default {})

    • options.encoding string text encoding of the input GTF. default 'utf8'
    • options.parseAll boolean default false. if true, will parse all items. overrides other flags
    • options.parseFeatures boolean default true
    • options.parseDirectives boolean default false
    • options.parseComments boolean default false
    • options.parseSequences boolean default true
    • options.bufferSize Number maximum number of GTF lines to buffer. defaults to 1000

Returns ReadableStream stream (in objectMode) of parsed items

parseFile

Read and parse a GTF file from the filesystem.

Parameters

  • filename string the filename of the file to parse

  • options Object optional options object

    • options.encoding string the file's string encoding, defaults to 'utf8'
    • options.parseAll boolean default false. if true, will parse all items. overrides other flags
    • options.parseFeatures boolean default true
    • options.parseDirectives boolean default false
    • options.parseComments boolean default false
    • options.parseSequences boolean default true
    • options.bufferSize Number maximum number of GTF lines to buffer. defaults to 1000

Returns ReadableStream stream (in objectMode) of parsed items

parseStringSync

Synchronously parse a string containing GTF and return an arrayref of the parsed items.

Parameters

  • str string

  • inputOptions Object optional options object (optional, default {})

    • inputOptions.parseAll boolean default false. if true, will parse all items. overrides other flags
    • inputOptions.parseFeatures boolean default true
    • inputOptions.parseDirectives boolean default false
    • inputOptions.parseComments boolean default false
    • inputOptions.parseSequences boolean default true

Returns Array array of parsed features, directives, and/or comments

formatSync

Format an array of GTF items (features,directives,comments) into string of GTF. Does not insert synchronization (###) marks. Does not insert directive if it's not already there.

Parameters

  • items

Returns String the formatted GTF

formatStream

Format a stream of items (of the type produced by this script) into a stream of GTF text.

Inserts synchronization (###) marks automatically.

Parameters

  • options Object

    • options.minSyncLines Object minimum number of lines between ### marks. default 100
    • options.insertVersionDirective Boolean if the first item in the stream is not a ##gff-version directive, insert one to show it's gtf default false

formatFile

Format a stream of items (of the type produced by this script) into a GTF file and write it to the filesystem.

Inserts synchronization (###) marks and a ##gtf directive automatically (if one is not already present).

Parameters

  • stream ReadableStream the stream to write to the file

  • filename String the file path to write to

  • options Object (optional, default {})

    • options.encoding String default 'utf8'. encoding for the written file
    • options.minSyncLines Number minimum number of lines between sync (###) marks. default 100
    • options.insertVersionDirective Boolean if the first item in the stream is not a ##gtf directive, insert one. default false

Returns Promise promise for the written filename

util

Table of Contents

util

unescape

Unescape a string/text value used in a GTF attribute. Textual attributes should be surrounded by double quotes source info: https://mblab.wustl.edu/GTF22.html https://en.wikipedia.org/wiki/Gene_transfer_format

Parameters

Returns String

_escape

Escape a value for use in a GTF attribute value.

Parameters

Returns String

escapeColumn

Escape a value for use in a GTF column value.

Parameters

Returns String

parseAttributes

Parse the 9th column (attributes) of a GTF feature line.

Parameters

Returns Object

parseFeature

Parse a GTF feature line.

Parameters

  • line String returns the parsed line in an object

parseDirective

Parse a GTF directive/comment line.

Parameters

Returns Object the information in the directive

formatAttributes

Format an attributes object into a string suitable for the 9th column of GTF.

Parameters

formatFeature

Format a feature object or array of feature objects into one or more lines of GTF.

Parameters

  • featureOrFeatures

formatDirective

Format a directive into a line of GTF.

Parameters

Returns String

formatComment

Format a comment into a GTF comment. Yes I know this is just adding a # and a newline.

Parameters

Returns String

formatSequence

Format a sequence object as FASTA

Parameters

Returns String formatted single FASTA sequence

formatItem

Format a directive, comment, or feature, or array of such items, into one or more lines of GTF.

Parameters

Notes and resources

License

MIT © Robert Buels