Kaitai Struct is a DSL (domain-specific language), designed to
describe binary data structures in human- and machine-readable
way. This reference is meant to be used as a complete spec on .ksy
files (that are used as input files for KS compiler): it will describe
how they work, which parts they consist of and how they’re used /
processed. It contains a lot of technical info, so it’s mostly suited
for those who want to write their own tools. If you just want to use
KS as end-user, please refer to user guide
instead.
The basic idea behind Kaitai Struct is very simple:
-
One can describe a certain data structure using KS language (this is only needed to be done once)
-
This description can be translated into a source code for many supported programming languages using a compiler, without the need to write language- and platform-specific code every time.
-
Generated code can be plugged right into a project in target language (usually as a native module or a library) and used right away.
KS compiler gets one or several .ksy
files for input.
Kaitai Struct data structure (format) descriptions (KSY files) are
simple YAML files and are usually saved using .ksy
extension to differentiate them from the rest of .yaml
files.
Every .ksy
file MUST be a valid YAML file, and is expected to be
parsed with generic YAML parsing libraries. Inside, every file MUST
provide a map of strings (keys) to some values. Each file is
essentially a single User-defined type spec.
User-defined type specification is an essential component of KSY specification. It declares a single user-defined type, which may include:
Note
|
User-defined type spec is recursive and can include other
user-defined type specs inside types element.
|
Any .ksy
file is a single user-defined type (exactly the same as any
nested subtypes), with two minor differences:
-
top-level type spec MUST include meta/id key that is used to give a name for top-level type,
-
all nested types MUST NOT have that key (as they already have a certain ID from the map key name provided in
types
— declaration of subtypes).
meta
key is a map of string to objects that provides
meta-information relevant the current user-defined type or KSY file in
whole. It also can be used to assign some defaults and provide some
configuration options for compiler.
Example:
meta:
id: foo_arc
title: Foo Archive
application: Foo Archiver v1.23
file-extension:
- fooarc
- fooarcz
license: CC0-1.0
ks-version: 0.9
imports:
- common/archive_header
- common/compressed_file
encoding: UTF-8
endian: le
-
Contents: a string that follows rules for all identifiers
-
Purpose: identifier for a primary structure described in top-level map
-
Influences: it would be converted to suit general formatting rules of a language and used as the name of class
-
Mandatory: yes
-
Contents: a string
-
Purpose: free-form text string that is a longer title of this .ksy file
-
Influences: nothing
-
Mandatory: no
-
Contents: a string
-
Purpose: free-form text string that describes application that’s associated with this particular format, if it’s a format used by single application
-
Influences: nothing
-
Mandatory: no
-
Contents: sequence of strings which contain valid filesystem characters (generally A-Z, a-z, 0-9, _, - and /) corresponding to a relative or absolute path to another .ksy file (without the .ksy extension)
-
Purpose: identify one or more .ksy files which will be imported
-
Influences: allows types defined within the imported .ksy files to be used in the current context
-
Mandatory: no
-
Contents: a string which is a user-defined encoding scheme, for example
ASCII
,UTF-8
,UTF-16LE
,UTF-16BE
,UTF-32LE
,UTF-32BE
or a Name from the IANA character sets registry -
Purpose: sets a default string encoding for this file
-
Influences: if set,
str
andstrz
data types will have their encoding by default set to this value -
Mandatory: no
-
Contents:
le
(for little-endian) orbe
(for big-endian) -
Purpose: sets a default endianness for this type and all nested subtypes
-
Influences: if set, primitive data types like
u4
would be treated as aliases tou4le
/u4be
(depending on the setting); if not set, attempt to use abbreviated types likeu4
(i.e. without full endianness qualifier) will yield compile-time error. -
Mandatory: no
-
Contents: a string which contains a Kaitai Struct version number
-
Purpose: sets the minimum version of Kaitai Struct Compiler (KSC) required to interpret this .ksy file
-
Influences: prevents this .ksy file from being read by older versions of KSC which may not understand newer syntax of this .ksy file
-
Mandatory: no
-
Contents:
true
orfalse
(default) -
Purpose: advise the Kaitai Struct Compiler (KSC) to use debug mode
-
Influences: when set to
true
, KSC will generate classes as if --debug mode was specified in the command line -
Mandatory: no
-
Contents:
true
orfalse
(default) -
Purpose: advise the Kaitai Struct Compiler (KSC) to ignore missing types in the .ksy file, and assume that these types are already provided externally by the environment the classes are generated for
-
Influences: when set to
true
, KSC will generate classes as if --opaque-types=true mode was specified in the command line -
Mandatory: no
-
Contents: a string which matches one of the identifiers within the SPDX license list
-
Purpose: identify the copyright license of this .ksy file
-
Influences: nothing
-
Mandatory: no
doc
element is used to give a more detailed description of a
user-defined type. In most target languages, it will be used as
docstring (i.e. a special comment which is exported as part of code
documentation), compatible with tools like
Javadoc,
Doxygen, JSDoc,
.NET XML
documentation comments, etc.
-
Contents: free-form string (note that multiple lines are allowed and newlines would be respected during compilation)
-
Purpose: provide longer description of a type for a developer that will use it
-
Influences: generated docstring comments
-
Mandatory: no
Example:
doc: |
A variable-length unsigned integer using base128 encoding. 1-byte groups
consists of 1-bit flag of continuation and 7-bit value, and are ordered
"most significant group first", i.e. in "big-endian" manner.
This particular encoding is specified and used in:
* Standard MIDI file format
* ASN.1 BER encoding
doc-ref
element can be used to provide reference to original
documentation, if your KSY file is actually an implementation of some
documented format.
-
Contents: one of:
-
URL as text
-
Arbitrary string
-
URL as text + space + arbitrary string
-
-
Purpose: provide reference to original documentation (either in HTML form, available to be referenced by certain URL, or just a free-form reference that can be used to address printed manuals, etc)
-
Influences: generated docstring comments, usually in a form of "see also".
-
If only text is provided, it will be rendered as neutral text.
-
If an URL is provided, it will be rendered an active hyperlink, if possible.
-
If both URL and text is provided, it will create an active hyperlink that leads to URL, with a visible caption equal to provided text.
-
-
Mandatory: no
Examples:
doc-ref: 'http://example.org/file-format-spec/1.0#header'
doc-ref: ECMA-119 standard, section 4.18 "Volume Set"
doc-ref: http://example.org/some-spec Header section
-
Contents: a sequence of Attribute spec elements
-
Purpose: identifier for a primary structure described in top-level map
-
Influences: would be translated into parsing method in a target class
-
Mandatory: no
-
Contents: map of strings to User-defined type spec
-
Purpose: declare types for sub-structures that could be referenced in Attribute spec in any
seq
orinstances
element -
Influences: would be translated into distinct classes (usually nested into main one, if target language allows it)
-
Mandatory: no
-
Contents: map of strings to Instance spec
-
Purpose: description of data that lies outside of normal sequential parsing flow (for example, that requires seeking somewhere in the file) or just needs to be loaded only by special request
-
Influences: would be translated into distinct methods (that read desired data on demand) in current class
-
Mandatory: no
-
Contents: map of strings to Enum spec
-
Purpose: allow to set up named enums: essentially a mapping between integer constants to some symbolic names; these enums can be used in integer attributes using enum key, thus converting it from simple integer attribute into a proper enum constant
-
Influences: would be represented as enum-like construct (or closest equivalent, if target language doesn’t support enums), nested or namespaced in current type/class
-
Mandatory: no
Attribute specification describes how to read and write one particular attribute — typically, a single number, a string, array of bytes, etc. Attribute can also be a complex structure, specified with a User-defined type spec. Each attribute is typically compiled into equivalent reading / writing instruction(s) in target language.
Every attribute MUST BE a map that maps certain keys to values. Some of these keys are common to every possible attribute spec, some are only valid for certain types.
Examples:
id: coord_x
type: f8
doc: X coordinate of a node.
id: body_len_64
type: u8
if: body_len_32 == 0
doc: |
Additional value that designates length of the body as 64-bit
integer. To save space in common cases where 32-bit store is enough,
present only if `body_len_32` is set to 0.
id: body
type: encoded_body
size: (body_len_32 == 0) ? body_len_64 : body_len_32
process: zlib
-
Contents: a string that matches
/^[a-z][a-z0-9_]*$/
— i.e. starts with lowercase letter and then may contain lowercase letters, numbers and underscore -
Purpose: identify attribute among others
-
Influences: used as variable / field name in target programming language
-
Mandatory:
-
yes (for attributes in a
seq
— sequence of attributes) -
forbidden (for attributes in
instances
)
-
-
Contents: one of:
-
a string in UTF-8 encoding
-
an array of:
-
bytes in decimal representation
-
bytes in hexadecimal representation, starting with
0x
-
strings in UTF-8 encoding
-
Purpose: specify fixed contents that should be encountered by parser at this point
-
Influences: parser checks if specified content exists at a given point in stream; if everything matches, then parsing continues; if content in the stream doesn’t match bytes specified in given
contents
, it will trigger a parsing exception, thus signalling that something went terribly wrong and it’s meaningless to continue parsing. -
Mandatory: no
Examples:
-
foo
— expect bytes66 6f 6f
-
[foo, 0, A, 0xa, 42]
— expect bytes66 6f 6f 00 41 0a 2a
-
[1, 0x55, '▒,3', 3]
— expect bytes01 55 e2 96 92 2c 33 03
Note
|
You can use either JSON or YAML array syntax, and quotes are optional in YAML syntax. |
-
Contents: one of primitive data types or a name of User-defined type spec
-
Purpose: define a data type for an attribute
-
Influences: how much bytes would be read, data type and contents of a variable in target programming language
-
Mandatory: no — if
type
is not specified, then attribute is considered [a generic byte sequence](#no-type-specified)
If type
is used to reference a User-defined type spec, then the following
algorithm it used to find which type is referred to, given the name:
-
It tries to find a given type by name in current type’s
types
— declaration of subtypes map. -
If that fails, it checks if current type actually has that name and if it does, uses current type recursively. Both type names given using a key in
types
— declaration of subtypes and type name of top-level type given with meta/id work. -
If that fails too, it goes one level up in the hierarchy of nested types and tries to resolve it there.
This mechanism is similar to the type name resolution algorithm that is used by C++, Java, Ruby, etc, and allows one to effectively use types as namespaces for subtypes, i.e. for example, this is legal:
meta
id: top_level
seq:
- id: foo
type: header
# resolves to /top_level/header ──┐
- id: bar # │
type: body1 # │
- id: baz # │
type: body2 # │
types: # │
header: # ... <─────────────────────┘ <─┐
body1: # │
seq: # │
- id: foo # │
type: header # │
# resolves to /top_level/header ──┘
body2:
seq:
- id: foo
type: header
# resolves to /top_level/second_level/header ──┐
types: # │
header: # ... <──────────────────────────────────┘
-
Contents:
expr
oreos
oruntil
-
Purpose: designate repeated attribute in a structure;
-
if
repeat: expr
is used, then attribute is repeated the number of times specified inrepeat-expr
key; -
if
repeat: eos
is used, then attribute is repeated until the end of current stream -
if
repeat: until
is used, then attribute is repeated until given expression becomes true (one may use a reference to last parsed element in such expression)
-
-
Influences: attribute would be read as array / list / sequence, executing parsing code multiple times
-
Mandatory: no
-
Contents: expression, expected to be of integer type
-
Purpose: specify number of repetitions for repeated attribute
-
Influences: number of times attribute is parsed
-
Mandatory: yes, if
repeat: expr
-
Contents: expression, expected to be of boolean type
-
Purpose: specify expression that would be checked each time after an element of requested type is parsed; while expression is false (i.e. until it becomes true), more elements would be parsed and added to resulting array; one can use
_
in expression as a special variable that references last read element -
Influences: number of times attribute is parsed
-
Mandatory: yes, if
repeat: until
-
Contents: expression, expected to be of boolean type
-
Purpose: mark the attribute as optional
-
Influences: attribute would be parsed only if condition specified in
if
key evaluates (in runtime) to true -
Mandatory: no
If there’s no type specified, attribute will be read just as a sequence of bytes from a stream. Thus, one has to decide on how many bytes to read. There are two ways:
-
Specify amount of bytes to read in
size
key. One can specify an integer constant or an [[expression|expressions]] in this field (for example, if the number of bytes to read depends on some other attribute). -
Set
size-eos: true
, thus ordering to read all the bytes till the end of current stream.
It is possible to apply some algorithmic processing to a byte buffer before accessing it. This can be done using Processing spec syntax.
One can map an integer to some Enum spec value with an enum
attribute.
Specifies a fixed-length string, i.e. first it reads a designated number of bytes, then it tries to convert bytes to characters using a specified encoding. There are 2 ways to specify amount of data to read:
-
Specify number of bytes to read directly in
size
key. One can specify an integer constant or an [[expression|expressions]] in this field (for example, if the number of bytes to read depends on some other attribute). -
Set
size-eos: true
, thus ordering to read all the bytes till the end of current stream.
Specifies parsing a string until a terminator
byte (i.e. C-style strings terminated with 0
).
-
Contents: integer that represents terminating byte
-
Purpose: string reading will stop when this byte will be encountered
-
Influences: field data type becomes given enum
-
Mandatory: no, default is
0
-
Contents: boolean
-
Purpose: specify if terminator byte should be "consumed" when reading - that is:
-
if
consume
is true, stream pointer will point to the byte after the terminator byte -
if
consume
is false, stream pointer will point to the terminator byte itself -
Influences: stream position after reading of string
-
Mandatory: no, default is
true
-
Contents: boolean
-
Purpose: specify if terminator byte should be considered a part of string read and thus appended to it
-
Influences: string parsed: if
true
, then resulting string would be 1 byte longer and that byte would be terminator byte -
Mandatory: no, default is
false
-
Contents: boolean
-
Purpose: allow ignoring of lack of terminator (disabling error reporting)
-
Influences:
-
normally (if
eos-error
istrue
), reading a stream without encountering the terminator byte would result in end-of-stream exception being raised; -
if
eos-error
isfalse
, string reading will stop successfully at: either: -
terminator being encountered, or
-
end of stream is reached string parsed: if
true
, then resulting string would be 1 byte longer and that byte would be terminator byte -
Mandatory: no, default is
true
There are several data types predefined in Kaitai Struct. They are used as basic building blocks for more complex data types.
Note
|
Usually reading and writing of primitive data types is very fast
and efficient, as it is implemented in most "native" way possible in a
target language/platform. For example, if you need to read 2-byte
integer, it is usually much more efficient to just use u2 type,
instead of doing two u1 reads and then composing these two bytes
using value instance".
|
Generally, integer type specification follows this pattern: ([us])(1|2|4|8)(le|be)
-
First letter —
u
ors
— specifies either unsigned or signed integer respectively -
Second group —
1
,2
,4
or8
— specifies width of an integer in bytes -
Third group —
le
orbe
— specifies little-endian or big-endian encoding respectively; it can be omitted if default endianness specified in meta/endian in a type spec.
For the sake of completeness, here’s the full table of available integer types:
type |
Width, bits | Signed? | Endianness | Min value | Max value |
---|---|---|---|---|---|
|
8 |
No |
N/A |
0 |
255 |
|
16 |
No |
Little |
0 |
65535 |
|
16 |
No |
Big |
0 |
65535 |
|
32 |
No |
Little |
0 |
4294967295 |
|
32 |
No |
Big |
0 |
4294967295 |
|
64 |
No |
Little |
0 |
18446744073709551615 |
|
64 |
No |
Big |
0 |
18446744073709551615 |
|
8 |
Yes |
N/A |
-128 |
127 |
|
16 |
Yes |
Little |
-32768 |
32767 |
|
16 |
Yes |
Big |
-32768 |
32767 |
|
32 |
Yes |
Little |
-2147483648 |
2147483647 |
|
32 |
Yes |
Big |
-2147483648 |
2147483647 |
|
64 |
Yes |
Little |
-9223372036854775808 |
9223372036854775807 |
|
64 |
Yes |
Big |
-9223372036854775808 |
9223372036854775807 |
To specify integers having non-standard number of bits in them, one
can use the following pattern: b(\d+)
, where \d+
is the number of
bits allocated.
Floating point number specification also follows the general pattern: f(4|8)(le|be)
-
First letter —
f
— specifies floating point type -
Second group —
4
or8
— specifies width of an integer in bytes -
Third group —
le
orbe
— specifies little-endian or big-endian encoding respectively; it can be omitted if default endianness specified in meta/endian in a type spec.
The general format of float follows IEEE 754 standard.
The full list of possible floating point type is thus:
type |
Width, bits | Endianness | Mantissa bits | Exponents bits |
---|---|---|---|---|
|
32 |
Big |
24 |
8 |
|
32 |
Little |
24 |
8 |
|
64 |
Big |
53 |
11 |
|
64 |
Little |
53 |
11 |
Byte arrays are used as generic "fallback" solution, where no type is defined, but we have some means to understand the size of the data. This means that one of the following is defined:
-
[attribute-size] — fixed size byte array
Strings are built on top of byte arrays, inheriting all the properties
that allow to designate size of underlying byte array. To designate
attribute as string type, use type: str
and provide encoding info,
either by specifying [attribute-encoding] key in the attribute, or
by applying type or file-wide default encoding in
meta/encoding.
Note
|
- id: foo
type: str
terminator: 0
- id: bar
type: strz |
Sometimes the data you’re working on is not only packed in some
structure, but also somehow encoded, obfuscated, encrypted,
compressed, etc. So, to be able to parse such data, one has to remove
this layer of encryption / obfuscation / compression / etc. This is
called "processing" in Kaitai Struct and it is supported with a range
of process
directives. These can be applied to raw byte buffers or
user-typed fields in the following way:
seq:
- id: buf1
size: 0x1000
process: zlib
This declares a field named buf1
. When parsing this structure, KS
will read exactly 0x1000 bytes from a source stream and then apply
zlib
processing, i.e. decompression of zlib-compressed
stream. Afterwards, accessing buf1
would return decompressed stream
(which would be most likely larger than 0x1000 bytes long), and
accessing _raw_buf1
property would return raw (originally
compressed) stream, exactly 0x1000 bytes long.
There are following processing directives available in Kaitai Struct.
Applies a bitwise XOR (bitwise exclusive "or", written as ^
in most C-like languages) to every byte of the stream. Length of output stays exactly the same as the length of input. There is one mandatory argument - the key to use for XOR operation. It can be:
-
a single byte value — in this case this value would be XORed with every byte of the input stream
-
an array of bytes — in this case, first byte of the input would be XORed with first byte of the key, second byte of the input with second byte of the keys, etc. If the key is shorter than the input, key will be reused, starting from the first byte.
For example, given 3-byte key [b0, b1, b2]
and input line [x0, x1, x2, x3, x4, …]
output will be:
[x0 ^ b0, x1 ^ b1, x2 ^ b2,
x3 ^ b0, x4 ^ b1, ...]
Examples:
-
process: xor(0xaa)
— XORs every byte with0xaa
-
process: xor([7, 42])
— XORs every odd (1st, 3rd, 5th, …) byte with7
, and every even (2nd, 4th, 6th, …) byte with42
-
process: xor(key_buf)
— XORs bytes using a key stored in a field namedkey_buf
Does a circular shift
operation on a buffer, rotating every byte by key
bits left (rol
)
or right (ror
).
Examples:
-
process: rol(5)
— rotates every byte 5 bits left: every given bit combinationb0-b1-b2-b3-b4-b5-b6-b7
becomesb5-b6-b7-b0-b1-b2-b3-b4
-
process: ror(some_val)
— rotates every byte right by number of bits determined bysome_val
attribute (which might be either parsed previously or calculated on the fly)
Applies a zlib
decompression to input buffer, expecting it to be a full-fledged zlib stream, i.e. having a regular 2-byte zlib header. Decompression parameters are chosen automatically from it. Typical zlib header values:
-
78 01
— no compression or low compression -
78 9C
— default compression -
78 DA
— best compression
Length of output buffer is usually larger that length of the input. This processing method might throw an exception if the data given is not a valid zlib stream.
Instance specification is very close to Attribute spec (and inherits all its properties), but it specifies an attribute that lies beyond regular parsing sequence. Typically, each instance is compiled into a lazy reader function/method that will parse (or calculate) requested data on demand, cache the result and return whatever’s been parsed previously on subsequent calls.
Everything that described in Attribute spec can be used, except for id, which is useless, because all instances already have name due to map string key.
Enum specification allows to set up a enum (or closest equivalent) construct in target language source file, which can then be referenced in attribute specs using enum key.
A given type can have multiple named enums, each of which is essentially a map from integers to strings. For example:
enums:
ip_protocol:
1: icmp
6: tcp
0x11: udp
port:
22: ssh
25: smtp
80: http
This one defines 2 named enums (named ip_protocol
and port
respectively), which can be referenced in attributes like that:
seq:
- id: src_port
type: u2
enum: port
Enum-mapped fields can be also used in Expressions. One can
compare it to enum constants, referencing it using
enum_name::
enum_string syntax:
seq:
- id: http_version
type: u1
if: src_port == port::http
or one can convert them back into an integer, for example:
seq:
- id: field_for_privileged_port
type: u1
if: src_port.to_i < 1024
Some fields (for example, repeat-expr,
[attribute-size], or if) allow to specify either
constant values (for example, 123
) or an expression that could
reference another attributes or instances.
A very typical example would be:
seq:
- id: filename_len
type: u4
- id: filename
type: str
size: filename_len
encoding: UTF-8
Here we do two things:
-
First, we read 4-byte unsigned integer is read and store it in
filename_len
attribute -
Second, we read an UTF-8 encoded string exactly
filename_len
bytes long, wherefilename_len
is a reference the previous attribute
These expressions form a fairly powerful expression language that would be translated into a relevant expression in target programming language.