Skip to content

Latest commit

 

History

History
962 lines (758 loc) · 29.1 KB

ksy_reference.adoc

File metadata and controls

962 lines (758 loc) · 29.1 KB

Kaitai Struct: KSY reference

Kaitai Struct is a DSL (domain-specific language), designed to describe binary data structures in human- and machine-readable way. This reference is meant to be used as a complete spec on .ksy files (that are used as input files for KS compiler): it will describe how they work, which parts they consist of and how they’re used / processed. It contains a lot of technical info, so it’s mostly suited for those who want to write their own tools. If you just want to use KS as end-user, please refer to user guide instead.

The basic idea behind Kaitai Struct is very simple:

  • One can describe a certain data structure using KS language (this is only needed to be done once)

  • This description can be translated into a source code for many supported programming languages using a compiler, without the need to write language- and platform-specific code every time.

  • Generated code can be plugged right into a project in target language (usually as a native module or a library) and used right away.

KS compiler gets one or several .ksy files for input.

KSY files

Kaitai Struct data structure (format) descriptions (KSY files) are simple YAML files and are usually saved using .ksy extension to differentiate them from the rest of .yaml files.

Every .ksy file MUST be a valid YAML file, and is expected to be parsed with generic YAML parsing libraries. Inside, every file MUST provide a map of strings (keys) to some values. Each file is essentially a single User-defined type spec.

User-defined type spec

User-defined type specification is an essential component of KSY specification. It declares a single user-defined type, which may include:

Note
User-defined type spec is recursive and can include other user-defined type specs inside types element.

Any .ksy file is a single user-defined type (exactly the same as any nested subtypes), with two minor differences:

  • top-level type spec MUST include meta/id key that is used to give a name for top-level type,

  • all nested types MUST NOT have that key (as they already have a certain ID from the map key name provided in types — declaration of subtypes).

meta — meta-information

meta key is a map of string to objects that provides meta-information relevant the current user-defined type or KSY file in whole. It also can be used to assign some defaults and provide some configuration options for compiler.

Example:

meta:
  id: foo_arc
  title: Foo Archive
  application: Foo Archiver v1.23
  file-extension:
    - fooarc
    - fooarcz
  license: CC0-1.0
  ks-version: 0.9
  imports:
    - common/archive_header
    - common/compressed_file
  encoding: UTF-8
  endian: le

id

  • Contents: a string that follows rules for all identifiers

  • Purpose: identifier for a primary structure described in top-level map

  • Influences: it would be converted to suit general formatting rules of a language and used as the name of class

  • Mandatory: yes

title

  • Contents: a string

  • Purpose: free-form text string that is a longer title of this .ksy file

  • Influences: nothing

  • Mandatory: no

application

  • Contents: a string

  • Purpose: free-form text string that describes application that’s associated with this particular format, if it’s a format used by single application

  • Influences: nothing

  • Mandatory: no

imports

  • Contents: sequence of strings which contain valid filesystem characters (generally A-Z, a-z, 0-9, _, - and /) corresponding to a relative or absolute path to another .ksy file (without the .ksy extension)

  • Purpose: identify one or more .ksy files which will be imported

  • Influences: allows types defined within the imported .ksy files to be used in the current context

  • Mandatory: no

encoding

  • Contents: a string which is a user-defined encoding scheme, for example ASCII, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE or a Name from the IANA character sets registry

  • Purpose: sets a default string encoding for this file

  • Influences: if set, str and strz data types will have their encoding by default set to this value

  • Mandatory: no

endian

  • Contents: le (for little-endian) or be (for big-endian)

  • Purpose: sets a default endianness for this type and all nested subtypes

  • Influences: if set, primitive data types like u4 would be treated as aliases to u4le / u4be (depending on the setting); if not set, attempt to use abbreviated types like u4 (i.e. without full endianness qualifier) will yield compile-time error.

  • Mandatory: no

ks-version

  • Contents: a string which contains a Kaitai Struct version number

  • Purpose: sets the minimum version of Kaitai Struct Compiler (KSC) required to interpret this .ksy file

  • Influences: prevents this .ksy file from being read by older versions of KSC which may not understand newer syntax of this .ksy file

  • Mandatory: no

ks-debug

  • Contents: true or false (default)

  • Purpose: advise the Kaitai Struct Compiler (KSC) to use debug mode

  • Influences: when set to true, KSC will generate classes as if --debug mode was specified in the command line

  • Mandatory: no

ks-opaque-types

  • Contents: true or false (default)

  • Purpose: advise the Kaitai Struct Compiler (KSC) to ignore missing types in the .ksy file, and assume that these types are already provided externally by the environment the classes are generated for

  • Influences: when set to true, KSC will generate classes as if --opaque-types=true mode was specified in the command line

  • Mandatory: no

license

  • Contents: a string which matches one of the identifiers within the SPDX license list

  • Purpose: identify the copyright license of this .ksy file

  • Influences: nothing

  • Mandatory: no

file-extension

  • Contents: a string or an array of strings

  • Purpose: roughly identify which files can be parsed with this format by filename extension

  • Influences: may be used for navigation purposes by browsing applications

  • Mandatory: no

doc — docstrings

doc element is used to give a more detailed description of a user-defined type. In most target languages, it will be used as docstring (i.e. a special comment which is exported as part of code documentation), compatible with tools like Javadoc, Doxygen, JSDoc, .NET XML documentation comments, etc.

  • Contents: free-form string (note that multiple lines are allowed and newlines would be respected during compilation)

  • Purpose: provide longer description of a type for a developer that will use it

  • Influences: generated docstring comments

  • Mandatory: no

Example:

doc: |
  A variable-length unsigned integer using base128 encoding. 1-byte groups
  consists of 1-bit flag of continuation and 7-bit value, and are ordered
  "most significant group first", i.e. in "big-endian" manner.

  This particular encoding is specified and used in:

  * Standard MIDI file format
  * ASN.1 BER encoding

doc-ref — documentation reference

doc-ref element can be used to provide reference to original documentation, if your KSY file is actually an implementation of some documented format.

  • Contents: one of:

    • URL as text

    • Arbitrary string

    • URL as text + space + arbitrary string

  • Purpose: provide reference to original documentation (either in HTML form, available to be referenced by certain URL, or just a free-form reference that can be used to address printed manuals, etc)

  • Influences: generated docstring comments, usually in a form of "see also".

    • If only text is provided, it will be rendered as neutral text.

    • If an URL is provided, it will be rendered an active hyperlink, if possible.

    • If both URL and text is provided, it will create an active hyperlink that leads to URL, with a visible caption equal to provided text.

  • Mandatory: no

Examples:

doc-ref: 'http://example.org/file-format-spec/1.0#header'
doc-ref: ECMA-119 standard, section 4.18 "Volume Set"
doc-ref: http://example.org/some-spec Header section

seq — sequence of attributes

  • Contents: a sequence of Attribute spec elements

  • Purpose: identifier for a primary structure described in top-level map

  • Influences: would be translated into parsing method in a target class

  • Mandatory: no

types — declaration of subtypes

  • Contents: map of strings to User-defined type spec

  • Purpose: declare types for sub-structures that could be referenced in Attribute spec in any seq or instances element

  • Influences: would be translated into distinct classes (usually nested into main one, if target language allows it)

  • Mandatory: no

instances

  • Contents: map of strings to Instance spec

  • Purpose: description of data that lies outside of normal sequential parsing flow (for example, that requires seeking somewhere in the file) or just needs to be loaded only by special request

  • Influences: would be translated into distinct methods (that read desired data on demand) in current class

  • Mandatory: no

enums

  • Contents: map of strings to Enum spec

  • Purpose: allow to set up named enums: essentially a mapping between integer constants to some symbolic names; these enums can be used in integer attributes using enum key, thus converting it from simple integer attribute into a proper enum constant

  • Influences: would be represented as enum-like construct (or closest equivalent, if target language doesn’t support enums), nested or namespaced in current type/class

  • Mandatory: no

Attribute spec

Attribute specification describes how to read and write one particular attribute — typically, a single number, a string, array of bytes, etc. Attribute can also be a complex structure, specified with a User-defined type spec. Each attribute is typically compiled into equivalent reading / writing instruction(s) in target language.

Every attribute MUST BE a map that maps certain keys to values. Some of these keys are common to every possible attribute spec, some are only valid for certain types.

Examples:

id: coord_x
type: f8
doc: X coordinate of a node.
id: body_len_64
type: u8
if: body_len_32 == 0
doc: |
  Additional value that designates length of the body as 64-bit
  integer. To save space in common cases where 32-bit store is enough,
  present only if `body_len_32` is set to 0.
id: body
type: encoded_body
size: (body_len_32 == 0) ? body_len_64 : body_len_32
process: zlib

Common keys

id

  • Contents: a string that matches /^[a-z][a-z0-9_]*$/ — i.e. starts with lowercase letter and then may contain lowercase letters, numbers and underscore

  • Purpose: identify attribute among others

  • Influences: used as variable / field name in target programming language

  • Mandatory:

doc

doc-ref

contents

  • Contents: one of:

  • a string in UTF-8 encoding

  • an array of:

  • bytes in decimal representation

  • bytes in hexadecimal representation, starting with 0x

  • strings in UTF-8 encoding

  • Purpose: specify fixed contents that should be encountered by parser at this point

  • Influences: parser checks if specified content exists at a given point in stream; if everything matches, then parsing continues; if content in the stream doesn’t match bytes specified in given contents, it will trigger a parsing exception, thus signalling that something went terribly wrong and it’s meaningless to continue parsing.

  • Mandatory: no

Examples:

  • foo — expect bytes 66 6f 6f

  • [foo, 0, A, 0xa, 42] — expect bytes 66 6f 6f 00 41 0a 2a

  • [1, 0x55, '▒,3', 3] — expect bytes 01 55 e2 96 92 2c 33 03

Note
You can use either JSON or YAML array syntax, and quotes are optional in YAML syntax.

type

  • Contents: one of primitive data types or a name of User-defined type spec

  • Purpose: define a data type for an attribute

  • Influences: how much bytes would be read, data type and contents of a variable in target programming language

  • Mandatory: no — if type is not specified, then attribute is considered [a generic byte sequence](#no-type-specified)

If type is used to reference a User-defined type spec, then the following algorithm it used to find which type is referred to, given the name:

  1. It tries to find a given type by name in current type’s types — declaration of subtypes map.

  2. If that fails, it checks if current type actually has that name and if it does, uses current type recursively. Both type names given using a key in types — declaration of subtypes and type name of top-level type given with meta/id work.

  3. If that fails too, it goes one level up in the hierarchy of nested types and tries to resolve it there.

This mechanism is similar to the type name resolution algorithm that is used by C++, Java, Ruby, etc, and allows one to effectively use types as namespaces for subtypes, i.e. for example, this is legal:

meta
  id: top_level
seq:
  - id: foo
    type: header
    # resolves to /top_level/header ──┐
  - id: bar     #
    type: body1 #
  - id: baz     #
    type: body2 #
types:          #
  header: # ... <─────────────────────┘ <─┐
  body1:             #
    seq:             #
      - id: foo      #
        type: header #
        # resolves to /top_level/header ──┘
  body2:
    seq:
      - id: foo
        type: header
        # resolves to /top_level/second_level/header ──┐
    types: #
      header: # ... <──────────────────────────────────┘

repeat

  • Contents: expr or eos or until

  • Purpose: designate repeated attribute in a structure;

    • if repeat: expr is used, then attribute is repeated the number of times specified in repeat-expr key;

    • if repeat: eos is used, then attribute is repeated until the end of current stream

    • if repeat: until is used, then attribute is repeated until given expression becomes true (one may use a reference to last parsed element in such expression)

  • Influences: attribute would be read as array / list / sequence, executing parsing code multiple times

  • Mandatory: no

repeat-expr

  • Contents: expression, expected to be of integer type

  • Purpose: specify number of repetitions for repeated attribute

  • Influences: number of times attribute is parsed

  • Mandatory: yes, if repeat: expr

repeat-until

  • Contents: expression, expected to be of boolean type

  • Purpose: specify expression that would be checked each time after an element of requested type is parsed; while expression is false (i.e. until it becomes true), more elements would be parsed and added to resulting array; one can use _ in expression as a special variable that references last read element

  • Influences: number of times attribute is parsed

  • Mandatory: yes, if repeat: until

if

  • Contents: expression, expected to be of boolean type

  • Purpose: mark the attribute as optional

  • Influences: attribute would be parsed only if condition specified in if key evaluates (in runtime) to true

  • Mandatory: no

Byte array keys

If there’s no type specified, attribute will be read just as a sequence of bytes from a stream. Thus, one has to decide on how many bytes to read. There are two ways:

  • Specify amount of bytes to read in size key. One can specify an integer constant or an [[expression|expressions]] in this field (for example, if the number of bytes to read depends on some other attribute).

  • Set size-eos: true, thus ordering to read all the bytes till the end of current stream.

size

size-eos

process

It is possible to apply some algorithmic processing to a byte buffer before accessing it. This can be done using Processing spec syntax.

Integer keys

One can map an integer to some Enum spec value with an enum attribute.

enum

  • Contents: name of existing enum

  • Purpose: apply mapping of parsed integer using a given enum dictionary into some sort of named constant

  • Influences: field data type becomes given enum

  • Mandatory: no

String keys

Specifies a fixed-length string, i.e. first it reads a designated number of bytes, then it tries to convert bytes to characters using a specified encoding. There are 2 ways to specify amount of data to read:

  • Specify number of bytes to read directly in size key. One can specify an integer constant or an [[expression|expressions]] in this field (for example, if the number of bytes to read depends on some other attribute).

  • Set size-eos: true, thus ordering to read all the bytes till the end of current stream.

size

size-eos

encoding

strz keys

Specifies parsing a string until a terminator byte (i.e. C-style strings terminated with 0).

terminator

  • Contents: integer that represents terminating byte

  • Purpose: string reading will stop when this byte will be encountered

  • Influences: field data type becomes given enum

  • Mandatory: no, default is 0

consume

  • Contents: boolean

  • Purpose: specify if terminator byte should be "consumed" when reading - that is:

  • if consume is true, stream pointer will point to the byte after the terminator byte

  • if consume is false, stream pointer will point to the terminator byte itself

  • Influences: stream position after reading of string

  • Mandatory: no, default is true

include

  • Contents: boolean

  • Purpose: specify if terminator byte should be considered a part of string read and thus appended to it

  • Influences: string parsed: if true, then resulting string would be 1 byte longer and that byte would be terminator byte

  • Mandatory: no, default is false

eos-error

  • Contents: boolean

  • Purpose: allow ignoring of lack of terminator (disabling error reporting)

  • Influences:

  • normally (if eos-error is true), reading a stream without encountering the terminator byte would result in end-of-stream exception being raised;

  • if eos-error is false, string reading will stop successfully at: either:

  • terminator being encountered, or

  • end of stream is reached string parsed: if true, then resulting string would be 1 byte longer and that byte would be terminator byte

  • Mandatory: no, default is true

Primitive data types

There are several data types predefined in Kaitai Struct. They are used as basic building blocks for more complex data types.

Note
Usually reading and writing of primitive data types is very fast and efficient, as it is implemented in most "native" way possible in a target language/platform. For example, if you need to read 2-byte integer, it is usually much more efficient to just use u2 type, instead of doing two u1 reads and then composing these two bytes using value instance".

Integers

Generally, integer type specification follows this pattern: ([us])(1|2|4|8)(le|be)

  • First letter — u or s — specifies either unsigned or signed integer respectively

  • Second group — 1, 2, 4 or 8 — specifies width of an integer in bytes

  • Third group — le or be — specifies little-endian or big-endian encoding respectively; it can be omitted if default endianness specified in meta/endian in a type spec.

For the sake of completeness, here’s the full table of available integer types:

type Width, bits Signed? Endianness Min value Max value

u1

8

No

N/A

0

255

u2le

16

No

Little

0

65535

u2be

16

No

Big

0

65535

u4le

32

No

Little

0

4294967295

u4be

32

No

Big

0

4294967295

u8le

64

No

Little

0

18446744073709551615

u8be

64

No

Big

0

18446744073709551615

s1

8

Yes

N/A

-128

127

s2le

16

Yes

Little

-32768

32767

s2be

16

Yes

Big

-32768

32767

s4le

32

Yes

Little

-2147483648

2147483647

s4be

32

Yes

Big

-2147483648

2147483647

s8le

64

Yes

Little

-9223372036854775808

9223372036854775807

s8be

64

Yes

Big

-9223372036854775808

9223372036854775807

Bit-size integers

To specify integers having non-standard number of bits in them, one can use the following pattern: b(\d+), where \d+ is the number of bits allocated.

Floats

Floating point number specification also follows the general pattern: f(4|8)(le|be)

  • First letter — f — specifies floating point type

  • Second group — 4 or 8 — specifies width of an integer in bytes

  • Third group — le or be — specifies little-endian or big-endian encoding respectively; it can be omitted if default endianness specified in meta/endian in a type spec.

The general format of float follows IEEE 754 standard.

The full list of possible floating point type is thus:

type Width, bits Endianness Mantissa bits Exponents bits

f4be

32

Big

24

8

f4le

32

Little

24

8

f8be

64

Big

53

11

f8le

64

Little

53

11

Byte arrays

Byte arrays are used as generic "fallback" solution, where no type is defined, but we have some means to understand the size of the data. This means that one of the following is defined:

Strings

Strings are built on top of byte arrays, inheriting all the properties that allow to designate size of underlying byte array. To designate attribute as string type, use type: str and provide encoding info, either by specifying [attribute-encoding] key in the attribute, or by applying type or file-wide default encoding in meta/encoding.

Note

type: strz can be also used as a shortcut to define a null-terminated string (C-style). I.e. these foo and bar attributes are equivalent:

- id: foo
  type: str
  terminator: 0
- id: bar
  type: strz

Processing spec

Sometimes the data you’re working on is not only packed in some structure, but also somehow encoded, obfuscated, encrypted, compressed, etc. So, to be able to parse such data, one has to remove this layer of encryption / obfuscation / compression / etc. This is called "processing" in Kaitai Struct and it is supported with a range of process directives. These can be applied to raw byte buffers or user-typed fields in the following way:

seq:
  - id: buf1
    size: 0x1000
    process: zlib

This declares a field named buf1. When parsing this structure, KS will read exactly 0x1000 bytes from a source stream and then apply zlib processing, i.e. decompression of zlib-compressed stream. Afterwards, accessing buf1 would return decompressed stream (which would be most likely larger than 0x1000 bytes long), and accessing _raw_buf1 property would return raw (originally compressed) stream, exactly 0x1000 bytes long.

There are following processing directives available in Kaitai Struct.

xor(key)

Applies a bitwise XOR (bitwise exclusive "or", written as ^ in most C-like languages) to every byte of the stream. Length of output stays exactly the same as the length of input. There is one mandatory argument - the key to use for XOR operation. It can be:

  • a single byte value — in this case this value would be XORed with every byte of the input stream

  • an array of bytes — in this case, first byte of the input would be XORed with first byte of the key, second byte of the input with second byte of the keys, etc. If the key is shorter than the input, key will be reused, starting from the first byte.

For example, given 3-byte key [b0, b1, b2] and input line [x0, x1, x2, x3, x4, …​] output will be:

[x0 ^ b0, x1 ^ b1, x2 ^ b2,
 x3 ^ b0, x4 ^ b1, ...]

Examples:

  • process: xor(0xaa) — XORs every byte with 0xaa

  • process: xor([7, 42]) — XORs every odd (1st, 3rd, 5th, …​) byte with 7, and every even (2nd, 4th, 6th, …​) byte with 42

  • process: xor(key_buf) — XORs bytes using a key stored in a field named key_buf

rol(key), ror(key)

Does a circular shift operation on a buffer, rotating every byte by key bits left (rol) or right (ror).

Examples:

  • process: rol(5) — rotates every byte 5 bits left: every given bit combination b0-b1-b2-b3-b4-b5-b6-b7 becomes b5-b6-b7-b0-b1-b2-b3-b4

  • process: ror(some_val) — rotates every byte right by number of bits determined by some_val attribute (which might be either parsed previously or calculated on the fly)

zlib

Applies a zlib decompression to input buffer, expecting it to be a full-fledged zlib stream, i.e. having a regular 2-byte zlib header. Decompression parameters are chosen automatically from it. Typical zlib header values:

  • 78 01 — no compression or low compression

  • 78 9C — default compression

  • 78 DA — best compression

Length of output buffer is usually larger that length of the input. This processing method might throw an exception if the data given is not a valid zlib stream.

Instance spec

Instance specification is very close to Attribute spec (and inherits all its properties), but it specifies an attribute that lies beyond regular parsing sequence. Typically, each instance is compiled into a lazy reader function/method that will parse (or calculate) requested data on demand, cache the result and return whatever’s been parsed previously on subsequent calls.

Everything that described in Attribute spec can be used, except for id, which is useless, because all instances already have name due to map string key.

pos

Specifies position in a stream from which the value should be parsed.

io

Specifies an IO stream from which a value should be parsed.

value

Overrides any reading & parsing. Instead, just calculates function specified in value and returns the result as this instance. Can be used for multitude of purposes, such as data conversion while reading, etc.

Enum spec

Enum specification allows to set up a enum (or closest equivalent) construct in target language source file, which can then be referenced in attribute specs using enum key.

A given type can have multiple named enums, each of which is essentially a map from integers to strings. For example:

enums:
  ip_protocol:
    1: icmp
    6: tcp
    0x11: udp
  port:
    22: ssh
    25: smtp
    80: http

This one defines 2 named enums (named ip_protocol and port respectively), which can be referenced in attributes like that:

seq:
  - id: src_port
    type: u2
    enum: port

Enum-mapped fields can be also used in Expressions. One can compare it to enum constants, referencing it using enum_name::enum_string syntax:

seq:
  - id: http_version
    type: u1
    if: src_port == port::http

or one can convert them back into an integer, for example:

seq:
  - id: field_for_privileged_port
    type: u1
    if: src_port.to_i < 1024

Expressions

Some fields (for example, repeat-expr, [attribute-size], or if) allow to specify either constant values (for example, 123) or an expression that could reference another attributes or instances.

A very typical example would be:

seq:
  - id: filename_len
    type: u4
  - id: filename
    type: str
    size: filename_len
    encoding: UTF-8

Here we do two things:

  • First, we read 4-byte unsigned integer is read and store it in filename_len attribute

  • Second, we read an UTF-8 encoded string exactly filename_len bytes long, where filename_len is a reference the previous attribute

These expressions form a fairly powerful expression language that would be translated into a relevant expression in target programming language.