This document describes the PlasCAD file format: a compact way of storing DNA sequences, features, primers, metadata,
and related information. It's a binary format divided into discrete packets, and uses the .pcad
file extension.
Code for implementing can be found in pcad.rs.
Most data structures use Bincode library. This is convenient for this program's purposes, but makes external interoperability more challenging.
Byte order is big endian.
The starting two bytes of a PlasCAD file are always 0xca
, 0xfe
.
The remaining bytes are divided into adjacent packets. Packets can be found in any order.
- Byte 0: Always
0x11
- Bytes 1-4: A 32-bit unsigned integer of payload size, in bytes.
- Byte 5: An 8-bit unsigned integer that indicates the packet's type. (See the
Packets
sections below for this mapping.) - Bytes 6-end: The payload; how this is encoded depends on packet type.
Contains a DNA sequence.
Bytes 0-3: A 32-bit unsigned integer of sequence length, in nucleotides. Remaining data: Every two bits is a nucleotide. This packet is always a whole number of bytes; bits in the final byte that would extend past the sequence length are ignored. Bit assignments for nucleotides is as follows:
- T:
0b00
- C:
0b01
- A:
0b10
- G:
0b11
This is the same nucleotide mapping as .2bit format.
The sequence CTGATTTCTG
. This would serialize as follows, using 7 total bytes:
- Bytes 0-3:
[0, 0, 0, 10]
, to indicate the sequence length of 10 nucleodies.
Three additional bytes to encode the sequence; each byte can fit 4 nucleotides:
- Byte 4:
CTGA
0b01_00_11_10
- Byte 5:
TTTC
0b00_00_00_01
- Byte 6:
TG
0b00_11_00_00
.
On the final byte, note the 0-fill on the right; we know not to encode it as T
due to the
sequence length.
A bincode serialization of a Vec<Feature>
A bincode serialization of a Vec<Primer>
A bincode serialization of a Metadata
A bincode serialization of a IonConcentrations
A bincode serialization of a Portions
A bincode serialization of a Option<PathBuf>
A bincode serialization of a Topology