Kaitai Struct: KSY reference

Table of Contents

KSY files
User-defined type spec
Attribute spec
- Common keys
  - id
  - doc
  - doc-ref
  - contents
  - type
  - repeat
  - repeat-expr
  - repeat-until
  - if
- Byte array keys
  - size
  - size-eos
  - process
- Integer keys
  - enum
- String keys
  - size
  - size-eos
  - encoding
- strz keys
  - terminator
  - consume
  - include
  - eos-error
Primitive data types
Processing spec
Instance spec
- pos
- io
- value
Enum spec
Expressions

Kaitai Struct is a DSL (domain-specific language), designed to describe binary data structures in human- and machine-readable way. This reference is meant to be used as a complete spec on .ksy files (that are used as input files for KS compiler): it will describe how they work, which parts they consist of and how they’re used / processed. It contains a lot of technical info, so it’s mostly suited for those who want to write their own tools. If you just want to use KS as end-user, please refer to user guide instead.

The basic idea behind Kaitai Struct is very simple:

One can describe a certain data structure using KS language (this is only needed to be done once)
This description can be translated into a source code for many supported programming languages using a compiler, without the need to write language- and platform-specific code every time.
Generated code can be plugged right into a project in target language (usually as a native module or a library) and used right away.

KS compiler gets one or several .ksy files for input.

KSY files

Kaitai Struct data structure (format) descriptions (KSY files) are simple YAML files and are usually saved using .ksy extension to differentiate them from the rest of .yaml files.

Every .ksy file MUST be a valid YAML file, and is expected to be parsed with generic YAML parsing libraries. Inside, every file MUST provide a map of strings (keys) to some values. Each file is essentially a single User-defined type spec.

User-defined type spec

User-defined type specification is an essential component of KSY specification. It declares a single user-defined type, which may include:

meta — meta-information
doc — docstrings
seq — sequence of attributes
instances
enums
types — declaration of subtypes
[params]

Note	User-defined type spec is recursive and can include other user-defined type specs inside `types` element.

Any .ksy file is a single user-defined type (exactly the same as any nested subtypes), with two minor differences:

top-level type spec MUST include meta/id key that is used to give a name for top-level type,
all nested types MUST NOT have that key (as they already have a certain ID from the map key name provided in types — declaration of subtypes).

`meta` — meta-information

meta key is a map of string to objects that provides meta-information relevant the current user-defined type or KSY file in whole. It also can be used to assign some defaults and provide some configuration options for compiler.

Example:

meta:
  id: foo_arc
  title: Foo Archive
  application: Foo Archiver v1.23
  file-extension:
    - fooarc
    - fooarcz
  license: CC0-1.0
  ks-version: 0.9
  imports:
    - common/archive_header
    - common/compressed_file
  encoding: UTF-8
  endian: le

id

Contents: a string that follows rules for all identifiers
Purpose: identifier for a primary structure described in top-level map
Influences: it would be converted to suit general formatting rules of a language and used as the name of class
Mandatory: yes

title

Contents: a string
Purpose: free-form text string that is a longer title of this .ksy file
Influences: nothing
Mandatory: no

application

Contents: a string
Purpose: free-form text string that describes application that’s associated with this particular format, if it’s a format used by single application
Influences: nothing
Mandatory: no

imports

Contents: sequence of strings which contain valid filesystem characters (generally A-Z, a-z, 0-9, _, - and /) corresponding to a relative or absolute path to another .ksy file (without the .ksy extension)
Purpose: identify one or more .ksy files which will be imported
Influences: allows types defined within the imported .ksy files to be used in the current context
Mandatory: no

encoding

Contents: a string which is a user-defined encoding scheme, for example ASCII, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE or a Name from the IANA character sets registry
Purpose: sets a default string encoding for this file
Influences: if set, str and strz data types will have their encoding by default set to this value
Mandatory: no

endian

Contents: le (for little-endian) or be (for big-endian)
Purpose: sets a default endianness for this type and all nested subtypes
Influences: if set, primitive data types like u4 would be treated as aliases to u4le / u4be (depending on the setting); if not set, attempt to use abbreviated types like u4 (i.e. without full endianness qualifier) will yield compile-time error.
Mandatory: no

ks-version

Contents: a string which contains a Kaitai Struct version number
Purpose: sets the minimum version of Kaitai Struct Compiler (KSC) required to interpret this .ksy file
Influences: prevents this .ksy file from being read by older versions of KSC which may not understand newer syntax of this .ksy file
Mandatory: no

ks-debug

Contents: true or false (default)
Purpose: advise the Kaitai Struct Compiler (KSC) to use debug mode
Influences: when set to true, KSC will generate classes as if --debug mode was specified in the command line
Mandatory: no

ks-opaque-types

Contents: true or false (default)
Purpose: advise the Kaitai Struct Compiler (KSC) to ignore missing types in the .ksy file, and assume that these types are already provided externally by the environment the classes are generated for
Influences: when set to true, KSC will generate classes as if --opaque-types=true mode was specified in the command line
Mandatory: no

license

Contents: a string which matches one of the identifiers within the SPDX license list
Purpose: identify the copyright license of this .ksy file
Influences: nothing
Mandatory: no

file-extension

Contents: a string or an array of strings
Purpose: roughly identify which files can be parsed with this format by filename extension
Influences: may be used for navigation purposes by browsing applications
Mandatory: no

`doc` — docstrings

doc element is used to give a more detailed description of a user-defined type. In most target languages, it will be used as docstring (i.e. a special comment which is exported as part of code documentation), compatible with tools like Javadoc, Doxygen, JSDoc, .NET XML documentation comments, etc.

Contents: free-form string (note that multiple lines are allowed and newlines would be respected during compilation)
Purpose: provide longer description of a type for a developer that will use it
Influences: generated docstring comments
Mandatory: no

Example:

doc: |
  A variable-length unsigned integer using base128 encoding. 1-byte groups
  consists of 1-bit flag of continuation and 7-bit value, and are ordered
  "most significant group first", i.e. in "big-endian" manner.

  This particular encoding is specified and used in:

  * Standard MIDI file format
  * ASN.1 BER encoding

`doc-ref` — documentation reference

doc-ref element can be used to provide reference to original documentation, if your KSY file is actually an implementation of some documented format.

Contents: one of:
- URL as text
- Arbitrary string
- URL as text + space + arbitrary string
Purpose: provide reference to original documentation (either in HTML form, available to be referenced by certain URL, or just a free-form reference that can be used to address printed manuals, etc)
Influences: generated docstring comments, usually in a form of "see also".
- If only text is provided, it will be rendered as neutral text.
- If an URL is provided, it will be rendered an active hyperlink, if possible.
- If both URL and text is provided, it will create an active hyperlink that leads to URL, with a visible caption equal to provided text.
Mandatory: no

Examples:

doc-ref: 'http://example.org/file-format-spec/1.0#header'
doc-ref: ECMA-119 standard, section 4.18 "Volume Set"
doc-ref: http://example.org/some-spec Header section

`seq` — sequence of attributes

Contents: a sequence of Attribute spec elements
Purpose: identifier for a primary structure described in top-level map
Influences: would be translated into parsing method in a target class
Mandatory: no

`types` — declaration of subtypes

Contents: map of strings to User-defined type spec
Purpose: declare types for sub-structures that could be referenced in Attribute spec in any seq or instances element
Influences: would be translated into distinct classes (usually nested into main one, if target language allows it)
Mandatory: no

`instances`

Contents: map of strings to Instance spec
Purpose: description of data that lies outside of normal sequential parsing flow (for example, that requires seeking somewhere in the file) or just needs to be loaded only by special request
Influences: would be translated into distinct methods (that read desired data on demand) in current class
Mandatory: no

`enums`

Contents: map of strings to Enum spec
Purpose: allow to set up named enums: essentially a mapping between integer constants to some symbolic names; these enums can be used in integer attributes using enum key, thus converting it from simple integer attribute into a proper enum constant
Influences: would be represented as enum-like construct (or closest equivalent, if target language doesn’t support enums), nested or namespaced in current type/class
Mandatory: no

Attribute spec

Attribute specification describes how to read and write one particular attribute — typically, a single number, a string, array of bytes, etc. Attribute can also be a complex structure, specified with a User-defined type spec. Each attribute is typically compiled into equivalent reading / writing instruction(s) in target language.

Every attribute MUST BE a map that maps certain keys to values. Some of these keys are common to every possible attribute spec, some are only valid for certain types.

Examples:

id: coord_x
type: f8
doc: X coordinate of a node.

id: body_len_64
type: u8
if: body_len_32 == 0
doc: |
  Additional value that designates length of the body as 64-bit
  integer. To save space in common cases where 32-bit store is enough,
  present only if `body_len_32` is set to 0.

id: body
type: encoded_body
size: (body_len_32 == 0) ? body_len_64 : body_len_32
process: zlib

Common keys

id

Contents: a string that matches /^[a-z][a-z0-9_]*$/ — i.e. starts with lowercase letter and then may contain lowercase letters, numbers and underscore
Purpose: identify attribute among others
Influences: used as variable / field name in target programming language
Mandatory:
- yes (for attributes in a seq — sequence of attributes)
- forbidden (for attributes in instances)

doc

doc-ref

Contents: one of:
a string in UTF-8 encoding
an array of:
bytes in decimal representation
bytes in hexadecimal representation, starting with 0x
strings in UTF-8 encoding
Purpose: specify fixed contents that should be encountered by parser at this point
Influences: parser checks if specified content exists at a given point in stream; if everything matches, then parsing continues; if content in the stream doesn’t match bytes specified in given contents, it will trigger a parsing exception, thus signalling that something went terribly wrong and it’s meaningless to continue parsing.
Mandatory: no

Examples:

foo — expect bytes 66 6f 6f
[foo, 0, A, 0xa, 42] — expect bytes 66 6f 6f 00 41 0a 2a
[1, 0x55, '▒,3', 3] — expect bytes 01 55 e2 96 92 2c 33 03

Note	You can use either JSON or YAML array syntax, and quotes are optional in YAML syntax.

type

Contents: one of primitive data types or a name of User-defined type spec
Purpose: define a data type for an attribute
Influences: how much bytes would be read, data type and contents of a variable in target programming language
Mandatory: no — if type is not specified, then attribute is considered [a generic byte sequence](#no-type-specified)

If type is used to reference a User-defined type spec, then the following algorithm it used to find which type is referred to, given the name:

It tries to find a given type by name in current type’s types — declaration of subtypes map.
If that fails, it checks if current type actually has that name and if it does, uses current type recursively. Both type names given using a key in types — declaration of subtypes and type name of top-level type given with meta/id work.
If that fails too, it goes one level up in the hierarchy of nested types and tries to resolve it there.

This mechanism is similar to the type name resolution algorithm that is used by C++, Java, Ruby, etc, and allows one to effectively use types as namespaces for subtypes, i.e. for example, this is legal:

meta
  id: top_level
seq:
  - id: foo
    type: header
    # resolves to /top_level/header ──┐
  - id: bar     #                     │
    type: body1 #                     │
  - id: baz     #                     │
    type: body2 #                     │
types:          #                     │
  header: # ... <─────────────────────┘ <─┐
  body1:             #                    │
    seq:             #                    │
      - id: foo      #                    │
        type: header #                    │
        # resolves to /top_level/header ──┘
  body2:
    seq:
      - id: foo
        type: header
        # resolves to /top_level/second_level/header ──┐
    types: #                                           │
      header: # ... <──────────────────────────────────┘

repeat

Contents: expr or eos or until
Purpose: designate repeated attribute in a structure;
- if repeat: expr is used, then attribute is repeated the number of times specified in repeat-expr key;
- if repeat: eos is used, then attribute is repeated until the end of current stream
- if repeat: until is used, then attribute is repeated until given expression becomes true (one may use a reference to last parsed element in such expression)
Influences: attribute would be read as array / list / sequence, executing parsing code multiple times
Mandatory: no

repeat-expr

Contents: expression, expected to be of integer type
Purpose: specify number of repetitions for repeated attribute
Influences: number of times attribute is parsed
Mandatory: yes, if repeat: expr

repeat-until

Contents: expression, expected to be of boolean type
Purpose: specify expression that would be checked each time after an element of requested type is parsed; while expression is false (i.e. until it becomes true), more elements would be parsed and added to resulting array; one can use _ in expression as a special variable that references last read element
Influences: number of times attribute is parsed
Mandatory: yes, if repeat: until

if

Contents: expression, expected to be of boolean type
Purpose: mark the attribute as optional
Influences: attribute would be parsed only if condition specified in if key evaluates (in runtime) to true
Mandatory: no

Byte array keys

If there’s no type specified, attribute will be read just as a sequence of bytes from a stream. Thus, one has to decide on how many bytes to read. There are two ways:

Specify amount of bytes to read in size key. One can specify an integer constant or an [[expression|expressions]] in this field (for example, if the number of bytes to read depends on some other attribute).
Set size-eos: true, thus ordering to read all the bytes till the end of current stream.

size

size-eos

process

It is possible to apply some algorithmic processing to a byte buffer before accessing it. This can be done using Processing spec syntax.

Integer keys

One can map an integer to some Enum spec value with an enum attribute.

enum

Contents: name of existing enum
Purpose: apply mapping of parsed integer using a given enum dictionary into some sort of named constant
Influences: field data type becomes given enum
Mandatory: no

String keys

Specifies a fixed-length string, i.e. first it reads a designated number of bytes, then it tries to convert bytes to characters using a specified encoding. There are 2 ways to specify amount of data to read:

Specify number of bytes to read directly in size key. One can specify an integer constant or an [[expression|expressions]] in this field (for example, if the number of bytes to read depends on some other attribute).
Set size-eos: true, thus ordering to read all the bytes till the end of current stream.

size

size-eos

encoding

`strz` keys

Specifies parsing a string until a terminator byte (i.e. C-style strings terminated with 0).

terminator

Contents: integer that represents terminating byte
Purpose: string reading will stop when this byte will be encountered
Influences: field data type becomes given enum
Mandatory: no, default is 0

consume

Contents: boolean
Purpose: specify if terminator byte should be "consumed" when reading - that is:
if consume is true, stream pointer will point to the byte after the terminator byte
if consume is false, stream pointer will point to the terminator byte itself
Influences: stream position after reading of string
Mandatory: no, default is true

include

Contents: boolean
Purpose: specify if terminator byte should be considered a part of string read and thus appended to it
Influences: string parsed: if true, then resulting string would be 1 byte longer and that byte would be terminator byte
Mandatory: no, default is false

eos-error

Contents: boolean
Purpose: allow ignoring of lack of terminator (disabling error reporting)
Influences:
normally (if eos-error is true), reading a stream without encountering the terminator byte would result in end-of-stream exception being raised;
if eos-error is false, string reading will stop successfully at: either:
terminator being encountered, or
end of stream is reached string parsed: if true, then resulting string would be 1 byte longer and that byte would be terminator byte
Mandatory: no, default is true

Primitive data types

There are several data types predefined in Kaitai Struct. They are used as basic building blocks for more complex data types.

Note

Usually reading and writing of primitive data types is very fast and efficient, as it is implemented in most "native" way possible in a target language/platform. For example, if you need to read 2-byte integer, it is usually much more efficient to just use u2 type, instead of doing two u1 reads and then composing these two bytes using value instance".

Integers

Generally, integer type specification follows this pattern: ([us])(1|2|4|8)(le|be)

First letter — u or s — specifies either unsigned or signed integer respectively
Second group — 1, 2, 4 or 8 — specifies width of an integer in bytes
Third group — le or be — specifies little-endian or big-endian encoding respectively; it can be omitted if default endianness specified in meta/endian in a type spec.

For the sake of completeness, here’s the full table of available integer types:

`type`	Width, bits	Signed?	Endianness	Min value	Max value
`u1`	8	No	N/A	0	255
`u2le`	16	No	Little	0	65535
`u2be`	16	No	Big	0	65535
`u4le`	32	No	Little	0	4294967295
`u4be`	32	No	Big	0	4294967295
`u8le`	64	No	Little	0	18446744073709551615
`u8be`	64	No	Big	0	18446744073709551615
`s1`	8	Yes	N/A	-128	127
`s2le`	16	Yes	Little	-32768	32767
`s2be`	16	Yes	Big	-32768	32767
`s4le`	32	Yes	Little	-2147483648	2147483647
`s4be`	32	Yes	Big	-2147483648	2147483647
`s8le`	64	Yes	Little	-9223372036854775808	9223372036854775807
`s8be`	64	Yes	Big	-9223372036854775808	9223372036854775807

Bit-size integers

To specify integers having non-standard number of bits in them, one can use the following pattern: b(\d+), where \d+ is the number of bits allocated.

Floats

Floating point number specification also follows the general pattern: f(4|8)(le|be)

First letter — f — specifies floating point type
Second group — 4 or 8 — specifies width of an integer in bytes
Third group — le or be — specifies little-endian or big-endian encoding respectively; it can be omitted if default endianness specified in meta/endian in a type spec.

The general format of float follows IEEE 754 standard.

The full list of possible floating point type is thus:

`type`	Width, bits	Endianness	Mantissa bits	Exponents bits
`f4be`	32	Big	24	8
`f4le`	32	Little	24	8
`f8be`	64	Big	53	11
`f8le`	64	Little	53	11

Byte arrays

Byte arrays are used as generic "fallback" solution, where no type is defined, but we have some means to understand the size of the data. This means that one of the following is defined:

[attribute-size] — fixed size byte array
[attribute-size-eos]
[attribute-terminator]

Strings

Strings are built on top of byte arrays, inheriting all the properties that allow to designate size of underlying byte array. To designate attribute as string type, use type: str and provide encoding info, either by specifying [attribute-encoding] key in the attribute, or by applying type or file-wide default encoding in meta/encoding.

Note	`type: strz` can be also used as a shortcut to define a null-terminated string (C-style). I.e. these `foo` and `bar` attributes are equivalent: - id: foo type: str terminator: 0 - id: bar type: strz

Processing spec

Sometimes the data you’re working on is not only packed in some structure, but also somehow encoded, obfuscated, encrypted, compressed, etc. So, to be able to parse such data, one has to remove this layer of encryption / obfuscation / compression / etc. This is called "processing" in Kaitai Struct and it is supported with a range of process directives. These can be applied to raw byte buffers or user-typed fields in the following way:

seq:
  - id: buf1
    size: 0x1000
    process: zlib

This declares a field named buf1. When parsing this structure, KS will read exactly 0x1000 bytes from a source stream and then apply zlib processing, i.e. decompression of zlib-compressed stream. Afterwards, accessing buf1 would return decompressed stream (which would be most likely larger than 0x1000 bytes long), and accessing _raw_buf1 property would return raw (originally compressed) stream, exactly 0x1000 bytes long.

There are following processing directives available in Kaitai Struct.

xor(key)

Applies a bitwise XOR (bitwise exclusive "or", written as ^ in most C-like languages) to every byte of the stream. Length of output stays exactly the same as the length of input. There is one mandatory argument - the key to use for XOR operation. It can be:

a single byte value — in this case this value would be XORed with every byte of the input stream
an array of bytes — in this case, first byte of the input would be XORed with first byte of the key, second byte of the input with second byte of the keys, etc. If the key is shorter than the input, key will be reused, starting from the first byte.

For example, given 3-byte key [b0, b1, b2] and input line [x0, x1, x2, x3, x4, …] output will be:

[x0 ^ b0, x1 ^ b1, x2 ^ b2,
 x3 ^ b0, x4 ^ b1, ...]

Examples:

process: xor(0xaa) — XORs every byte with 0xaa
process: xor([7, 42]) — XORs every odd (1st, 3rd, 5th, …) byte with 7, and every even (2nd, 4th, 6th, …) byte with 42
process: xor(key_buf) — XORs bytes using a key stored in a field named key_buf

rol(key), ror(key)

Does a circular shift operation on a buffer, rotating every byte by key bits left (rol) or right (ror).

Examples:

process: rol(5) — rotates every byte 5 bits left: every given bit combination b0-b1-b2-b3-b4-b5-b6-b7 becomes b5-b6-b7-b0-b1-b2-b3-b4
process: ror(some_val) — rotates every byte right by number of bits determined by some_val attribute (which might be either parsed previously or calculated on the fly)

zlib

Applies a zlib decompression to input buffer, expecting it to be a full-fledged zlib stream, i.e. having a regular 2-byte zlib header. Decompression parameters are chosen automatically from it. Typical zlib header values:

78 01 — no compression or low compression
78 9C — default compression
78 DA — best compression

Length of output buffer is usually larger that length of the input. This processing method might throw an exception if the data given is not a valid zlib stream.

Instance spec

Instance specification is very close to Attribute spec (and inherits all its properties), but it specifies an attribute that lies beyond regular parsing sequence. Typically, each instance is compiled into a lazy reader function/method that will parse (or calculate) requested data on demand, cache the result and return whatever’s been parsed previously on subsequent calls.

Everything that described in Attribute spec can be used, except for id, which is useless, because all instances already have name due to map string key.

pos

Specifies position in a stream from which the value should be parsed.

io

Specifies an IO stream from which a value should be parsed.

value

Overrides any reading & parsing. Instead, just calculates function specified in value and returns the result as this instance. Can be used for multitude of purposes, such as data conversion while reading, etc.

Enum spec

Enum specification allows to set up a enum (or closest equivalent) construct in target language source file, which can then be referenced in attribute specs using enum key.

A given type can have multiple named enums, each of which is essentially a map from integers to strings. For example:

enums:
  ip_protocol:
    1: icmp
    6: tcp
    0x11: udp
  port:
    22: ssh
    25: smtp
    80: http

This one defines 2 named enums (named ip_protocol and port respectively), which can be referenced in attributes like that:

seq:
  - id: src_port
    type: u2
    enum: port

Enum-mapped fields can be also used in Expressions. One can compare it to enum constants, referencing it using enum_name::enum_string syntax:

seq:
  - id: http_version
    type: u1
    if: src_port == port::http

or one can convert them back into an integer, for example:

seq:
  - id: field_for_privileged_port
    type: u1
    if: src_port.to_i < 1024

Expressions

Some fields (for example, repeat-expr, [attribute-size], or if) allow to specify either constant values (for example, 123) or an expression that could reference another attributes or instances.

A very typical example would be:

seq:
  - id: filename_len
    type: u4
  - id: filename
    type: str
    size: filename_len
    encoding: UTF-8

Here we do two things:

First, we read 4-byte unsigned integer is read and store it in filename_len attribute
Second, we read an UTF-8 encoded string exactly filename_len bytes long, where filename_len is a reference the previous attribute

These expressions form a fairly powerful expression language that would be translated into a relevant expression in target programming language.

Files

ksy_reference.adoc

Latest commit

History

ksy_reference.adoc

File metadata and controls

Kaitai Struct: KSY reference

KSY files

User-defined type spec

meta — meta-information

id

title

application

imports

encoding

endian

ks-version

ks-debug

ks-opaque-types

license

file-extension

doc — docstrings

doc-ref — documentation reference

seq — sequence of attributes

types — declaration of subtypes

instances

enums

Attribute spec

Common keys

id

doc

doc-ref

contents

type

repeat

repeat-expr

repeat-until

if

Byte array keys

size

size-eos

process

Integer keys

enum

String keys

size

size-eos

encoding

strz keys

terminator

consume

include

eos-error

Primitive data types

Integers

Bit-size integers

Floats

Byte arrays

Strings

Processing spec

xor(key)

rol(key), ror(key)

zlib

Instance spec

pos

io

value

Enum spec

Expressions

`meta` — meta-information

`doc` — docstrings

`doc-ref` — documentation reference

`seq` — sequence of attributes

`types` — declaration of subtypes

`instances`

`enums`

`strz` keys