Sereal - Protocol definition
This document describes the format and encoding of a Sereal data packet.
This is the Sereal specification version 4.00.
The integer part of the document version corresponds to the Sereal protocol version. For details on incompatible changes between major protocol versions, see the "PROTOCOL CHANGES" below.
A serialized structure is converted into a "document". A document is made up of two parts, the document header and the document body.
A compliant Sereal decoder must detect invalid documents and handle them as a safe exception in the respective implementation language. That is to say, without a crash or worse.
All numeric data is in little endian format.
Floating points types are in IEEE format.
Heavy use is made of a variable length integer encoding commonly called a "varint" (Google calls it a Varint128). This encoding uses the high bit of each byte to signal there is another byte worth of data coming, and the last byte always having the high bit off. The data is in little endian order with the low seven bits in the first byte, and the next 7 in the next etc.
See Google's description.
A header consists of multiple components:
<MAGIC> <VERSION-TYPE> <HEADER-SUFFIX-SIZE> <OPT-SUFFIX>
A "magic string" that identifies a document as being in the Sereal format. In protocol version 1 and 2, the value of this string is "=srl", and when decoded as an unsigned 32 bit integer on a little endian machine has a value of 0x6C72733D. Version 1 and 2 of the protocol require this magic string.
In protocol version 3 and higher the magic string has been changed to "=\xF3rl", where \xF3 is "s" with the high bit set. The little endian integer form of this string is 0x6C72F33D. Having a hight bit set in the magic string makes it easy to detect when a Sereal document has been accidentally UTF-8 encoded because the \xF3 is translated to \xC3\xB3.
Decoders are required to support the magic string associated to the protocol versions they can decode, so if a decoder can handle v1, v2, v3 and v4 then it should handle both magic header.
It is an error to use a new magic header on a v1 or v2 packet, and it is an error to use the old magic header in v3 or later.
A single byte, of which the high 4 bits are used to represent the "type" of the document, and the low 4 bits used to represent the version of the Sereal protocol the document complies with.
Up until now there have been versions 1, 2, 3 and 4 of the Sereal protocol. So the low four bits will be one of those values in little-endian.
Currently only five types are defined:
- 0
-
Raw Sereal format. The data can be processed verbatim.
- 1
-
This is not a valid document type for Sereal protocol version 2 and up!
In Sereal protocol version 1, this used to be "Compressed Sereal format, using Google's Snappy compression internally." It has long been advised to prefer 2, "incremental-decoding-enabled compressed Sereal," wherever possible.
- 2
-
Compressed Sereal format, using Google's Snappy compression internally as format 1, but supporting incremental-parsing. Long preferred over 1 as this is considered a bug fix in the Snappy compression support.
The format is:
<Varint><Snappy Blob>
where the varint signifies the length of the Snappy-compressed blob following it. See "NOTES ON IMPLEMENTATION" below for a discussion on how to implement this efficiently.
- 3
-
Compressed Sereal format, using zlib compression. This does similar framing as the incremental Snappy compression (2):
<Varint><Varint><Zlib Blob>
where the first varint indicates the length of the uncompressed document, the second varint indicates the length of the compressed document. See "NOTES ON IMPLEMENTATION" below for a discussion on how to implement this efficiently.
This compression format is new in v3 of the specification.
- 4
-
Compressed Sereal format, using Zstandard compression internally as format 4.
The format is:
<Varint><Zstd Blob>
where the varint signifies the length of the zstd-compressed blob following it. See "NOTES ON IMPLEMENTATION" below for a discussion on how to implement this efficiently.
This compression format is new in v4 of the specification.
Additional compression types are envisaged and will be assigned type numbers by the maintainers of the protocol.
The structure of the header includes support for embedding additional data. This is accomplished by specifying the length of the suffix in the header with a varint. Headers with no suffix will set this to a binary 0. This is intended for future format extensions that retain some level of compatibility for old decoders (which know how to skip the extended header due to the embedded length).
The suffix may contain whatever data the encoder wishes to embed in the header. In version 1 of the protocol the decoder never looked inside this data. Later versions may introduce additional rules for this field. Starting from version 2 of the protocol, this variable-length part of the header may be empty or have the following format:
<8bit-BITFIELD> <OPT-USER-META-DATA>
- 8bit-BITFIELD
-
If not present, all bits are assumed off. In version 2 and 3 of the protocol, only the least significant bit is meaningful: If set, the bitfield is followed by the
<USER-META-DATA>
. If not set, there is no user meta data. - OPT-USER-META-DATA
-
If the least significant bit of the preceding bitfield is set, this may be an arbitrary Sereal document body. Like any other Sereal document body, it is self-contained and can be deserialized independently from any other document bodies in the Sereal document. This document body is NEVER compressed.
This is intended for embedding small amounts of meta data, such as routing information, in a document that allows users to avoid deserializing very large document bodies needlessly or having to call into decompression logic.
The body is made up of tagged data items:
<TAG> <OPT-DATA>
Tagged items can be containers that hold other tagged items. At the top level, the body holds only ONE tagged item (often an array or hash) that holds others.
A tag is a single byte which specifies the type of the data being decoded.
The high bit of each tag is used to signal to the decoder that the deserialized data needs to be stored and tracked and will be reused again elsewhere in the serialization. This is sometimes called the "track flag" or the "F-bit" in code and documentation. Its status should be ignored when processing a tag, meaning code should mask off the high bit and only use the low 7 bits.
Some tags, such as POS, NEG and SHORT_BINARY contain embedded in them either the data (in the case of POS and NEG) or the length of the OPT-DATA section (in the case of SHORT_BINARY).
This field may contain an arbitrary set of bytes, either determined implicitly by the tag (such as for FLOAT), explicitly in the tag (as in SHORT_BINARY) or in a varint following the tag (such as for STRING).
When referring to an offset below, what's meant is a varint encoded absolute integer byte position in the document body. That is, an offset of 10 refers to the tenth byte in the Sereal document body (ie. excluding its header). Sereal version 1 used to mandate offsets from the start of the document header.
Tag | Char | Dec | Hex | Binary | Follow
------------------+------+-----+------+----------- |-----------------------------------------
POS_0 | | 0 | 0x00 | 0b00000000 | small positive integer - value in low 4 bits (identity)
POS_1 | | 1 | 0x01 | 0b00000001 |
POS_2 | | 2 | 0x02 | 0b00000010 |
POS_3 | | 3 | 0x03 | 0b00000011 |
POS_4 | | 4 | 0x04 | 0b00000100 |
POS_5 | | 5 | 0x05 | 0b00000101 |
POS_6 | | 6 | 0x06 | 0b00000110 |
POS_7 | "\a" | 7 | 0x07 | 0b00000111 |
POS_8 | "\b" | 8 | 0x08 | 0b00001000 |
POS_9 | "\t" | 9 | 0x09 | 0b00001001 |
POS_10 | "\n" | 10 | 0x0a | 0b00001010 |
POS_11 | | 11 | 0x0b | 0b00001011 |
POS_12 | "\f" | 12 | 0x0c | 0b00001100 |
POS_13 | "\r" | 13 | 0x0d | 0b00001101 |
POS_14 | | 14 | 0x0e | 0b00001110 |
POS_15 | | 15 | 0x0f | 0b00001111 | small positive integer - value in low 4 bits (identity)
NEG_16 | | 16 | 0x10 | 0b00010000 | small negative integer - value in low 4 bits (k+32)
NEG_15 | | 17 | 0x11 | 0b00010001 |
NEG_14 | | 18 | 0x12 | 0b00010010 |
NEG_13 | | 19 | 0x13 | 0b00010011 |
NEG_12 | | 20 | 0x14 | 0b00010100 |
NEG_11 | | 21 | 0x15 | 0b00010101 |
NEG_10 | | 22 | 0x16 | 0b00010110 |
NEG_9 | | 23 | 0x17 | 0b00010111 |
NEG_8 | | 24 | 0x18 | 0b00011000 |
NEG_7 | | 25 | 0x19 | 0b00011001 |
NEG_6 | | 26 | 0x1a | 0b00011010 |
NEG_5 | "\e" | 27 | 0x1b | 0b00011011 |
NEG_4 | | 28 | 0x1c | 0b00011100 |
NEG_3 | | 29 | 0x1d | 0b00011101 |
NEG_2 | | 30 | 0x1e | 0b00011110 |
NEG_1 | | 31 | 0x1f | 0b00011111 | small negative integer - value in low 4 bits (k+32)
VARINT | " " | 32 | 0x20 | 0b00100000 | <VARINT> - Varint variable length integer
ZIGZAG | "!" | 33 | 0x21 | 0b00100001 | <ZIGZAG-VARINT> - Zigzag variable length integer
FLOAT | "\"" | 34 | 0x22 | 0b00100010 | <IEEE-FLOAT>
DOUBLE | "#" | 35 | 0x23 | 0b00100011 | <IEEE-DOUBLE>
LONG_DOUBLE | "\$" | 36 | 0x24 | 0b00100100 | <IEEE-LONG-DOUBLE>
UNDEF | "%" | 37 | 0x25 | 0b00100101 | None - Perl undef var; eg my $var= undef;
BINARY | "&" | 38 | 0x26 | 0b00100110 | <LEN-VARINT> <BYTES> - binary/(latin1) string
STR_UTF8 | "'" | 39 | 0x27 | 0b00100111 | <LEN-VARINT> <UTF8> - utf8 string
REFN | "(" | 40 | 0x28 | 0b00101000 | <ITEM-TAG> - ref to next item
REFP | ")" | 41 | 0x29 | 0b00101001 | <OFFSET-VARINT> - ref to previous item stored at offset
HASH | "*" | 42 | 0x2a | 0b00101010 | <COUNT-VARINT> [<KEY-TAG> <ITEM-TAG> ...] - count followed by key/value pairs
ARRAY | "+" | 43 | 0x2b | 0b00101011 | <COUNT-VARINT> [<ITEM-TAG> ...] - count followed by items
OBJECT | "," | 44 | 0x2c | 0b00101100 | <STR-TAG> <ITEM-TAG> - class, object-item
OBJECTV | "-" | 45 | 0x2d | 0b00101101 | <OFFSET-VARINT> <ITEM-TAG> - offset of previously used classname tag - object-item
ALIAS | "." | 46 | 0x2e | 0b00101110 | <OFFSET-VARINT> - alias to item defined at offset
COPY | "/" | 47 | 0x2f | 0b00101111 | <OFFSET-VARINT> - copy of item defined at offset
WEAKEN | "0" | 48 | 0x30 | 0b00110000 | <REF-TAG> - Weaken the following reference
REGEXP | "1" | 49 | 0x31 | 0b00110001 | <PATTERN-STR-TAG> <MODIFIERS-STR-TAG>
OBJECT_FREEZE | "2" | 50 | 0x32 | 0b00110010 | <STR-TAG> <ITEM-TAG> - class, object-item. Need to call "THAW" method on class after decoding
OBJECTV_FREEZE | "3" | 51 | 0x33 | 0b00110011 | <OFFSET-VARINT> <ITEM-TAG> - (OBJECTV_FREEZE is to OBJECT_FREEZE as OBJECTV is to OBJECT)
NO | "4" | 52 | 0x34 | 0b00110100 | SvIsBOOL() == PL_No, 5.36 and later only (json false)
YES | "5" | 53 | 0x35 | 0b00110101 | SvIsBOOL() == PL_Yes, 5.36 and later only (json true)
RESERVED_0 | "6" | 54 | 0x36 | 0b00110110 |
RESERVED_1 | "7" | 55 | 0x37 | 0b00110111 |
FLOAT_128 | "8" | 56 | 0x38 | 0b00111000 | quadmath _float128
CANONICAL_UNDEF | "9" | 57 | 0x39 | 0b00111001 | undef (PL_sv_undef) - "the" Perl undef (see notes)
FALSE | ":" | 58 | 0x3a | 0b00111010 | false (PL_sv_no)
TRUE | ";" | 59 | 0x3b | 0b00111011 | true (PL_sv_yes)
MANY | "<" | 60 | 0x3c | 0b00111100 | <LEN-VARINT> <TYPE-BYTE> <TAG-DATA> - repeated tag (not done yet, will be implemented in version 3)
PACKET_START | "=" | 61 | 0x3d | 0b00111101 | (first byte of magic string in header)
EXTEND | ">" | 62 | 0x3e | 0b00111110 | <BYTE> - for additional tags
PAD | "?" | 63 | 0x3f | 0b00111111 | (ignored tag, skip to next byte)
ARRAYREF_0 | "\@" | 64 | 0x40 | 0b01000000 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
ARRAYREF_1 | "A" | 65 | 0x41 | 0b01000001 |
ARRAYREF_2 | "B" | 66 | 0x42 | 0b01000010 |
ARRAYREF_3 | "C" | 67 | 0x43 | 0b01000011 |
ARRAYREF_4 | "D" | 68 | 0x44 | 0b01000100 |
ARRAYREF_5 | "E" | 69 | 0x45 | 0b01000101 |
ARRAYREF_6 | "F" | 70 | 0x46 | 0b01000110 |
ARRAYREF_7 | "G" | 71 | 0x47 | 0b01000111 |
ARRAYREF_8 | "H" | 72 | 0x48 | 0b01001000 |
ARRAYREF_9 | "I" | 73 | 0x49 | 0b01001001 |
ARRAYREF_10 | "J" | 74 | 0x4a | 0b01001010 |
ARRAYREF_11 | "K" | 75 | 0x4b | 0b01001011 |
ARRAYREF_12 | "L" | 76 | 0x4c | 0b01001100 |
ARRAYREF_13 | "M" | 77 | 0x4d | 0b01001101 |
ARRAYREF_14 | "N" | 78 | 0x4e | 0b01001110 |
ARRAYREF_15 | "O" | 79 | 0x4f | 0b01001111 | [<ITEM-TAG> ...] - count of items in low 4 bits (ARRAY must be refcnt=1)
HASHREF_0 | "P" | 80 | 0x50 | 0b01010000 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
HASHREF_1 | "Q" | 81 | 0x51 | 0b01010001 |
HASHREF_2 | "R" | 82 | 0x52 | 0b01010010 |
HASHREF_3 | "S" | 83 | 0x53 | 0b01010011 |
HASHREF_4 | "T" | 84 | 0x54 | 0b01010100 |
HASHREF_5 | "U" | 85 | 0x55 | 0b01010101 |
HASHREF_6 | "V" | 86 | 0x56 | 0b01010110 |
HASHREF_7 | "W" | 87 | 0x57 | 0b01010111 |
HASHREF_8 | "X" | 88 | 0x58 | 0b01011000 |
HASHREF_9 | "Y" | 89 | 0x59 | 0b01011001 |
HASHREF_10 | "Z" | 90 | 0x5a | 0b01011010 |
HASHREF_11 | "[" | 91 | 0x5b | 0b01011011 |
HASHREF_12 | "\\" | 92 | 0x5c | 0b01011100 |
HASHREF_13 | "]" | 93 | 0x5d | 0b01011101 |
HASHREF_14 | "^" | 94 | 0x5e | 0b01011110 |
HASHREF_15 | "_" | 95 | 0x5f | 0b01011111 | [<KEY-TAG> <ITEM-TAG> ...] - count in low 4 bits, key/value pairs (HASH must be refcnt=1)
SHORT_BINARY_0 | "`" | 96 | 0x60 | 0b01100000 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
SHORT_BINARY_1 | "a" | 97 | 0x61 | 0b01100001 |
SHORT_BINARY_2 | "b" | 98 | 0x62 | 0b01100010 |
SHORT_BINARY_3 | "c" | 99 | 0x63 | 0b01100011 |
SHORT_BINARY_4 | "d" | 100 | 0x64 | 0b01100100 |
SHORT_BINARY_5 | "e" | 101 | 0x65 | 0b01100101 |
SHORT_BINARY_6 | "f" | 102 | 0x66 | 0b01100110 |
SHORT_BINARY_7 | "g" | 103 | 0x67 | 0b01100111 |
SHORT_BINARY_8 | "h" | 104 | 0x68 | 0b01101000 |
SHORT_BINARY_9 | "i" | 105 | 0x69 | 0b01101001 |
SHORT_BINARY_10 | "j" | 106 | 0x6a | 0b01101010 |
SHORT_BINARY_11 | "k" | 107 | 0x6b | 0b01101011 |
SHORT_BINARY_12 | "l" | 108 | 0x6c | 0b01101100 |
SHORT_BINARY_13 | "m" | 109 | 0x6d | 0b01101101 |
SHORT_BINARY_14 | "n" | 110 | 0x6e | 0b01101110 |
SHORT_BINARY_15 | "o" | 111 | 0x6f | 0b01101111 |
SHORT_BINARY_16 | "p" | 112 | 0x70 | 0b01110000 |
SHORT_BINARY_17 | "q" | 113 | 0x71 | 0b01110001 |
SHORT_BINARY_18 | "r" | 114 | 0x72 | 0b01110010 |
SHORT_BINARY_19 | "s" | 115 | 0x73 | 0b01110011 |
SHORT_BINARY_20 | "t" | 116 | 0x74 | 0b01110100 |
SHORT_BINARY_21 | "u" | 117 | 0x75 | 0b01110101 |
SHORT_BINARY_22 | "v" | 118 | 0x76 | 0b01110110 |
SHORT_BINARY_23 | "w" | 119 | 0x77 | 0b01110111 |
SHORT_BINARY_24 | "x" | 120 | 0x78 | 0b01111000 |
SHORT_BINARY_25 | "y" | 121 | 0x79 | 0b01111001 |
SHORT_BINARY_26 | "z" | 122 | 0x7a | 0b01111010 |
SHORT_BINARY_27 | "{" | 123 | 0x7b | 0b01111011 |
SHORT_BINARY_28 | "|" | 124 | 0x7c | 0b01111100 |
SHORT_BINARY_29 | "}" | 125 | 0x7d | 0b01111101 |
SHORT_BINARY_30 | "~" | 126 | 0x7e | 0b01111110 |
SHORT_BINARY_31 | | 127 | 0x7f | 0b01111111 | <BYTES> - binary/latin1 string, length encoded in low 5 bits of tag
The protocol uses a combination of the offset of a tracked tag and the flag bit to be able to encode and reconstruct cyclic structures in a single pass.
An encoder must track duplicated items and generate the appropriate ALIAS or REFP tags to reconstruct them, and when it does so ensure that the high bit of the original tag has been set.
When a decoder encounters a tag with its flag set it will remember the offset of the tag in the output packet and the item that was decoded from that tag. At a later point in the packet there may be an ALIAS or REFP instruction which will refer to the item by its offset, and the decoder will reuse it as needed.
Sometimes it is convenient to be able to reuse a previously emitted sequence in the packet to reduce duplication. For instance a data structure with many hashes with the same keys. The COPY tag is used for this. Its argument is a varint which is the offset of a previously emitted tag, and decoders are to behave as though the tag it references was inserted into the packet stream as a replacement for the COPY tag.
Note, that in this case the track flag is not set. It is assumed the decoder can jump back to reread the tag from its location alone.
COPY tags are forbidden from referring to another COPY tag, and are also forbidden from referring to anything containing a COPY tag, with the exception that a COPY tag used as a value may refer to an tag that uses a COPY tag for a classname or hash key.
Sereal supports three string representations. Two are "encodingless" and are SHORT_BINARY and BINARY, where binary means "raw bytes". The other is STR_UTF8 which is expected to contain valid canonical UTF8 encoded unicode text data. Under normal circumstances a decoder is not expected to validate that this is actually the case, and is allowed to simply extract the data verbatim.
SHORT_BINARY stores the length of the string in the tag itself and is used for strings of less than 32 characters long. Both BINARY and STR_UTF8 use a varint to indicate the number of bytes (octets) in the string.
Note that the use of the term "BINARY" simply means that we do not know what encoding the data uses, it does not imply that such data is not text, nor does STR_UTF8 imply that the data is text (although it is likely).
What STR_UTF8 does imply that the data is encoded using the rules for encoding utf8
, which is similar but subtly distinct from true UTF-8
. The main distinction between the two is that Perls utf8
can be used to represent a wider range of codepoints than UTF-8
, including codepoints that are forbidden in UTF-8
, such as codepoints U+D800
through U+DFFF
inclusive (which are used ONLY in UTF-16
), and codepoints larger than U+10FFFF
which is the maximum allowable Unicode codepoint.
Sereal does NOT validate that such STR_UTF8 strings *are* correctly encoded as it trusts Perl to do the right thing. When implementing Sereal in other languages it is up to the implementor how to handle these illegal codepoints, we do not expect such data to reliably round trip between Perl and another language as some languages, eg Javascript cannot properly handle such data. It is extremely rare to see code that actually makes use of such codepoints, and Perl will not do so on its own.
Hash keys are always one of the string types, or a COPY tag referencing a string.
Objects are serialized as a class name and a tag which represents the objects data. In Perl land this will always be a reference. Mapping Perl objects to other languages is left to the future, but the OBJECT_FREEZE and OBJECTV_FREEZE tags provide a basic method of doing that, see below.
Note that classnames MUST be a string, or a COPY tag referencing a string.
OBJECTV varints MUST reference a previously used classname, and not an arbitrary string.
Sereal implementations may choose to allow authors of classes to provide hooks for custom object serialization. Depending on the Sereal implementation, this feature may require enabling with an encoder option on the encoding side, but compliant decoders must at least recognize the OBJECT_FREEZE and OBJECTV_FREEZE tags. The interface shall be such that if enabled in the encoder, for each object in the input that has a FREEZE
method, the encoder will invoke said FREEZE
method on the object and pass in the string Sereal
to allow distinguishing from other serializers (this is inspired by the CBOR::XS
CBOR implementation). If there is no FREEZE
method available, then a normal OBJECT or OBJECTV tag is emitted, serializing the object content deeply. If invoked, the FREEZE
method must return a list of data structures that are serializable by Sereal. The encoder shall emit an OBJECT_FREEZE or OBJECTV_FREEZE tag followed by a reference (REFN) to an array (ARRAY) of the Sereal-encoded data structures that were returned from FREEZE
.
Upon decoding OBJECT_FREEZE or OBJECTV_FREEZE, a compliant decoder (unless explicitly instructed not to) will invoke the THAW
class method of the given class. (Likely, implementations should throw a fatal error if no such method exists for a class referenced by OBJECT(V)_FREEZE.) Arguments to that method will be the string Sereal
as first argument, and then the decoded data structures that were returned from the FREEZE
call. The return value of that THAW
call needs to be included in the final output structure. See the documentation of the Perl Sereal implemenation for examples of FREEZE/THAW methods.
The concept of undef is a little tricky in Perl. A variable may be undefined, in addition there is also a definitive "undef" which Perl uses in many situations. This definitive "undef" is a globally shared, immutable value. Its use is vaguely equivalent to aliasing the same, read-only copy of a Perl value that happens to be undefined.
The difference can be illustrated with the following code:
my $x;
print +(\$x == \$x) ? "same" : "different", "\n";
print +(\undef == \undef) ? "same" : "different", "\n";
print +(\$x == \undef) ? "same" : "different", "\n";
which should print out
same
same
different
In protocol versions 1 and 2 it was not possible to represent both forms of undef correctly, and Sereal defaulted to the "undefined variable" interpretation represented by the UNDEF tag in most situations.
As of protocol version 3 the CANONICAL_UNDEF tag is used to handle this special case of undef so that Perl data structures can round trip properly. Other languages are free to treat CANONICAL_UNDEF and UNDEF as is appropriate to their language semantics: If there is an equivalent to this globally shared undefined value (PL_sv_undef in Perl's implementation) then they should map CANONICAL_UNDEF accordingly, otherwise they are free to treat CANONICAL_UNDEF the same as UNDEF.
"zstd" compression (document type 4) support is added. As detailed above, its structure is
<Varint><Zstd Blob>
where the varint indicates the length of the compressed document.
In Sereal protocol version 3, the magic string has been changed to make it easier to detect UTF-8 encoded data by setting the high bit on the 's' character, thus changing the older "=srl" to "=\xF3rl". Encoders generating version 3 of the protocol or later must use the new header, and encoders generating version 1 or 2 of the protocol must use the old header.
Also new is the "zlib" compression (document type 3). As detailed above, its structure is
<Varint><Varint><Zlib Blob>
where the first varint indicates the length of the uncompressed document, the second varint indicates the length of the compressed document.
Additionally there is the new CANONICAL_UNDEF tag, used to represent Perl's canonical, shared undefined value (PL_sv_undef) in certain edge cases. See "Dealing with undefined values" for details.
In Sereal protocol version 2, offsets were changed from being relative to the start of the document (including header) to being relative to the start of the document body (ie. excluding the document header). This means that Sereal document bodies are now self-contained - relocatable within the document. Note that the offset is 1-based, which means that to point the first byte of the body its value must be 1.
Additionally, protocol version 2 introduced the 8bit bit-field (8bit-BITFIELD) in the variable-length/optional header part (OPT-SUFFIX) of the document and the user-meta-data section (OPT-USER-META-DATA) of the variable-length header.
Protocol version 2 introduces the OBJECT_FREEZE and OBJECTV_FREEZE tags in place of two previously reserved tags. The meaning and implementation of these two tags is described in the "Handling Objects" section of this document. In a nutshell, it allows application developers to have custom hooks for serializing and deserializing the instances of their classes.
With Sereal body format type 2 (see above), you need to encode (as a varint) the length of the Snappy-compressed document as a prefix to the document body. This is somewhat tricky to do efficiently since at first sight, the amount of space required to encode a varint depends on the size of the output. This means that you need to do the normal Sereal-encoding of the document body, then compress the output of that, then append the varint encoded length of the compressed data to a Sereal header, then append the compressed data. In this naive way of implementing this Snappy compression support, you may end up having to copy around the entire document up to three times (and may end up having to allocate 3x the space, too). That is very inefficient.
There is a better way, though, that's just a tiny bit subtle. Thankfully, you have an upper bound on the size of the compressed blob. It's the uncompressed blob plus the size of the Snappy header (a Snappy library call can tell you what that is in practice). What you can do is before compressing, you allocate a varint that is long enough to encode an integer that is big enough to represent the upper limit on the compressed output size. Then you proceed to point the compressor into the buffer right after the thusly preallocated varint. After compression, you'll know the real size of the compressed blob. Now, you go back to the varint and fill it in. If the reserved space for the varint is larger than what you actually need, then thanks to the way varints work, you can simply set the high bit on the last byte of the varint, and continue to set the high bits of all following padding bytes except the last, which you set to 0 (NUL). For details on why that works, please refer to the Google ProtoBuf documentation referenced earlier. With this specially crafted varint, any normal varint parsing function will treat it as a single varint and skip right to the start of theSnappy-compressed blob. The varint is a correct varint, just not in the canonical form. With this modified plan, you should only need one extra malloc, and (beyond that which the Snappy implementation does), no extra, large memcpy operations.
Yves Orton <[email protected]>
Damian Gryski
Steffen Mueller <[email protected]>
Rafaël Garcia-Suarez
Ævar Arnfjörð Bjarmason <[email protected]>
This protocol was originally developed for Booking.com. With approval from Booking.com, this document was generalized and published on github and CPAN, for which the authors would like to express their gratitude.
Copyright (C) 2012, 2013, 2014 by Steffen Mueller
Copyright (C) 2012, 2013, 2014 by Yves Orton
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.