-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdesign.tex
192 lines (131 loc) · 30.3 KB
/
design.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
\vspace{-0.8em}
\section{The Design of \sys{}}
\sys{}'s key goal is to provide {\em fine-grained and flexible typing} throughout data processing. Achieving this goal allows \sys{} to unify the document and relational models
%(embodying the best features of both simultaneously)
and enable new functionality for data introspection. \sys{}'s key technique is a new {\em data type abstraction} that represents the structure of an individual data value; achieving fine-grained yet flexible typing is a new approach. We now describe \sys{} types and how they impact \sys{}'s data model (\S\ref{ss:zed_data_model}), %runtime (\S\ref{ss:zed_type_contexts}),
query language (\S\ref{ss:zed_query_language}), and formats (\S\ref{ss:zed_formats}).
%A type in \sys{} plays a similar role as a schema of a row in a relational database or the implied type of a single JSON document.
\begin{comment}
Realizing this type abstraction requires answering several design questions, such as:
\begin{CompactItemize}
\item Type definitions - what should a type definition specify?
\item Type evolution - how can we enable users to easily create new types on the fly?
\item Type scope - what is the scope of a type definition and how can we relate types across different scopes?
\item Querying types - how can we represent types in queries and in query results?
\end{CompactItemize}
In this section, we answer these questions by overviewing \sys{}'s super-structured data model (\S\ref{ss:zed_data_model}) and query language (\S\ref{ss:zed_query_language}). In addition, we observe that a single unified data model does not necessitate a single unified data format. As pointed out by prior work, no single format is best for all use cases~\cite{one_size_fits_all_2005, one_size_fits_all_2007}, and we describe how \sys{} supports multiple formats (human-readable, columnar binary, etc.) with lossless transformations between them (\S\ref{ss:zed_formats}). Finally, we explain how \sys{} manages type scope (\S\ref{ss:zed_type_contexts}).
\end{comment}
\vspace{-1em}
\subsection{\sys{} Data Model} \label{ss:zed_data_model}
\begin{comment}
The \sys{} data model is guided by two key principles. First:
{\bf P1:} {\em Every data value should have an explicit, queryable data type.}
\noindent{}{\em Explicit} types are an essential property of the relational model. They enable efficient storage formats~\cite{parquet, orc, protobufs, avro, dremel} and faster parsing and querying than with JSON data, which is known to be difficult to parse efficiently~\cite{mison, sparser}. Types that are also {\em queryable} enable rich data introspection, queries by type, and shaping by type (\S\ref{ss:hybrid_schema}). At the same time:
{\bf P2:} {\em We should be flexible about which data types are allowed to coexist in the same file, table, or query results.}
\noindent{}This {\em flexibility} is essential to the document model and allows any data value with valid syntax to be appended to a file, regardless of what other data or data types are already present in that file. This makes it easy to accommodate heterogeneous and evolving data without predefining its schema.
\end{comment}
We first describe the \sys{} data model and then highlight some of the unconventional design decisions behind it.
\subsubsection{Data Model}
The \sys{} data model consists of an ordered sequence of typed data values. Each value's type is either a primitive type (int32, string, bool, type, etc.), a complex type (record, array, set, map, union, etc.), a named type, or the null type. All of these types have analogs in other data models, and we describe the significance of some of them below. For example, consider \sys{} records. A record consists of an ordered set of named values called ``fields''. Field names are simply strings and field values can be any \sys{} value, with the types described above. This enables nested data, as a record can contain arrays, sets, other records, etc. A row in a relational table, a record in Avro, or a JSON document could each be represented using a single \sys{} record.
For example, consider the following two records of IoT sensor data, represented in \sys{}'s text-based format, ZSON, which supports native epoch 64-bit nanosecond times in ISO format:
\noindent{}\mintinline[fontsize=\inlinefontsize]{bash}{{ts:2022-09-01T00:00:00Z,temp:68(int32)}}
\noindent{}\mintinline[fontsize=\inlinefontsize]{bash}{{ts:2022-09-01T10:12:00Z,temp:68.(float32)}}
\noindent{}The first record has type \texttt{\{ts:time,temp:int32\}} while the second record has type \texttt{\{ts:time, percent\_humidity:float32\}}.
Despite having different types, these two records can be stored in the same file because data types in \sys{} are {\bf associated with individual data values}.
%Thus, a sequence of \sys{} records can be used to represent either homogeneous data, as in a relational table or Parquet file, or heterogeneous data, as in the document model.
Types are {\bf first-class members} of the \sys{} data model. This means that a type---even the type of a record---is a single entity, which can be represented as a \sys{} value whose type is {\em type}. As a result, a query for the type of a record returns the record's type in the same \sys{} data model, rather than a separate data model as in existing systems~\cite{aurum}. \sys{} also enables users to define {\em named types}, which bind a name to a type. For example, ZSON includes decorator syntax for defining named types, e.g., \texttt{(=temperature)} to define a named \texttt{temperature} type:
\noindent{}\mintinline[fontsize=\inlinefontsize]{bash}{{ts:2022-09-01T00:00:00Z,temp:68(int32)}(=temperature)}
\noindent{}The \sys{} query language (\S\ref{ss:zed_query_language}) uses angle brackets to specify type values, thereby allowing users to refer to the type \texttt{\{time:time,temp:int32\}} using \texttt{<temperature>} in queries . For example, a query can extract all records with type \texttt{<temperature>}, the equivalent of a single relational \texttt{temperature} table.
%\sys{}'s binary data formats also leverage numeric type IDs for efficient encoding and querying (\S\ref{ss:zed_formats}).
%For example, a query for the type of the first record in Figure~\ref{f:iot_data} returns \texttt{<temp\_schema=\{time:time,temp:int32\}>}, and this result is still represented in the \sys{} data model.
Note that the data itself is defining the named types requiring the query engine to understand that a named type binding can evolve over time as data is processed. When multiple bindings for a given type name conflict with one another, the system retains each instance of each named type in the query language may include operators for forming a named sum type as a union of the conflicting types. With such operators, an ingest pipeline, for instance, can convert the the conflicting named types to a cohesive sum type.
Type definitions in \sys{} are stored {\bf inline} with the data. When a new type first appears in a \sys{} file, the type definition is stored inline at that point in the file. The \texttt{temperature} record above shows how \sys{} does this in its text-based format; we describe how \sys{} efficiently encodes type definitions in its binary formats in~\shortorlongform{\S\ref{ss:zed_formats}}{Section~\ref{ss:zed_formats}}. A consequence of inlined types is that type definitions are locally scoped, which means that values from different locally scoped type contexts must somehow be merged so that types from different contexts can coexist and be reliably compared. To do so, we carefully designed the \sys{} type system to allow merging of types to be implemented with a simple table lookup as described in Section~\ref{ss:zed_type_contexts}.
Finally, as with Parquet and Arrow, each type in \sys{} specifies the {\bf complete} and potentially nested structure of data values with that type. This contrasts with the document model, which provides only partial type information such as \texttt{OBJECT}, \texttt{JSON}, or \texttt{any}. In \sys{}, any value is nullable, e.g., a record of type \texttt{T} can omit fields of \texttt{T} by explicitly setting them to \texttt{null}; this avoids a combinatoric explosion in the number of defined types when different records omit different combinations of fields. Types are ``closed,'' meaning that a record of type \texttt{T} never contains extra fields beyond those defined by \texttt{T}. That said, it is often useful to define the non-nullability of fields as a useful data constraint, but our philosphy is to defer such constraint enforcement to additional semantic layers above the \sys{} serialization layer, e.g., in an ingest pipeline or in a query language.
Complete types may appear to be at odds with \sys{}'s goal of flexibility. For example, suppose a temperature sensor outputs \texttt{temperature} records with two fields at first, but then adds a third \texttt{unit} field~\shortorlongform{}{ to \texttt{temperature}} so that it can output data as either Fahrenheit or Celsius, as shown in Figure~\ref{f:iot_data}, which is the classic "schema evolution" problem. But as described above, in \sys{}, this simply implies a sum type of the variations encountered, which provides extreme flexibility in handling dynamically evolving eclectic data. In this approach, the sum-typed data can simply flow through a query or into the storage layer unhindered, or the condition could be detected by ingest logic and the sum-types converted to a desired target type with \sys{} logic, or the conflicting types could simply be rejected. The \sys{} philosophy is to allow all possibilities and leave such decisions to logic implemented (e.g., in the \sys{} query language) to semantic layers higher up on in the data stack, i.e., schema evolution can be flexibly implemented on top of the \sys{} data model. With this approach, the schema evolution and high-fidelity data serialization strategies become orthogonal and much easier to manage.
\begin{comment}
\begin{figure}
\begin{minted}[]{bash}
{city:"Augusta"(string),population:18662(uint32)}(=city_schema)
{city:"Bangor"(string),population:32029(uint32)}(=city_schema)
{city:"Bar Harbor"(string),population:5535(uint32)}(=city_schema)
{city:"Bath"(string),state:"ME"(string),population:8333(uint32)} (=city_schema)
{city:"Belfast"(string),state:"ME"(string),population: 6710(uint32)}(=city_schema)
\end{minted}
\vspace{-1em}
\caption{Example \sys{} data, represented in text-based ZSON.}
\label{f:city_data}
\end{figure}
\end{comment}
\begin{figure}
\begin{minted}[]{bash}
{time:09:01:00(time),temp:68(int32)}(=temperature)
{time:10:12:00(time),percent_humidity:43.7(float32)}(=humidity)
{time:17:29:00(time),temp:71(int32)}(=temperature)
{time:17:45:00(time),temp:80(int32),unit:"F"(string)}(=temperature)
{time:18:02:00(time),temp:28(int32),unit:"C"(string)}(=temperature)
\end{minted}
\vspace{-1.3em}
\caption{Example IoT data from temperature and humidity sensors, represented in text-based ZSON.\shortorlongform{}{\footnotemark}}
\label{f:iot_data}
\vspace{-1.5em}
\end{figure}
\shortorlongform{}{\footnotetext{The \texttt{time} type in \sys{} represents nanoseconds from epoch. For simplicity, we abbreviate time values in the example \sys{} data shown in this paper.}}
\subsubsection{Design Decisions}
\sys{} differs from existing data models in how it relates types to data values. In relational databases or Parquet files, a type is defined for an entire table or file and all records it contains share the same type. This achieves comprehensive types but not flexibility. In contrast, in the document model, each record is self describing and has its own implied type, and there is no way to specify that multiple records have the same type. This achieves flexibility but not comprehensive typing. By decoupling the type assignment from the data's organization into files, \sys{} can achieve comprehensive types and flexibility simultaneously.
Some existing approaches define types in schema registries, resulting in globally scoped types; a type defined in a registry may be used by any program in that administrative domain~\cite{confluent_schema_registry}. However, globally scoped types are at odds with flexibility, because ensuring compliance with a schema registry may restrict what kinds of data users can write. Scoping named types locally to a file and inlining their type definitions provides data sources complete freedom to define arbitrary new types on the fly and ensures that data consumers always have the required type information to decode data.
\sys{} also takes an unconventional approach regarding what a type definition specifies. Some existing data models flexibly accommodate eclectic data by supporting partial types such as \texttt{JSON} or \texttt{OBJECT}\shortorlongform{~\cite{snowflake}}{~\cite{snowflake, json_type_postgres, json_type_mysql}}, or by allowing ``open'' data types, where a record may contain extra fields beyond those specified by its type's type definition~\cite{asterixdb, json_schema, xml_schema}. However, incomplete type information makes it challenging to leverage efficient formats such as columnar~\cite{snowflake} and makes the results of queries about types less useful. We also observe that union types can handle cases where a single field might reasonably assume one of a few different types (e.g., an ID field that can be an int or string), and heterogeneity of type structure can be accommodated by making it easy to define new types. As a result, \sys{} takes a different approach based on complete types.
Finally, while prior work has explored expressing relation names within data~\shortorlongform{\cite{fisql_2005, relation_names, schemasql}}{\cite{fisql_2005, fisql_2007, relation_names, schemasql}}, we are aware of no existing data model that can express an entire schema or type as a single value within data.
\subsection{\sys{} Query Language} \label{ss:zed_query_language}
The \sys{} query language aims to support the same core functionalities as existing query languages for relational and document data, and to also provide new functionality by exposing type information to users so that they can query by and for types. \sys{} supports both of these---querying data and querying types---within the same language. We will briefly sketch \sys{}'s query language here; a complete description is beyond the scope of this paper.
The \sys{} language borrows heavily from existing query languages such as SQL, jq, and Lorel~\shortorlongform{\cite{lorel}}{\cite{lorel, lore_xml}}, as well as traditional UNIX shells. Similar to jq and Lorel, the \sys{} language can issue queries across heterogeneous data values and tolerate missing fields. For example, a query for \texttt{avg(temp)} over the data in Figure~\ref{f:iot_data} will quietly skip the non-matching \texttt{humidity} record and return the average of the \texttt{temp} fields in the other four records. The Zed language also supports many standard SQL operators such as \texttt{join} and \texttt{where}, as well as common UNIX commands such as \texttt{head} and \texttt{sort}.
A key novelty of the \sys{} language is that it takes well-known features of existing programming languages---type introspection and first-class types---and applies them to a query language. {\em Type introspection} is the ability to obtain the type of an individual data value. Python supports type introspection with a \texttt{type} function, and \sys{} enables it with a \texttt{typeof()} operator. This is a necessary building block to enable rich queries about types, and is feasible to implement because the \sys{} data model associates a type with every individual data value, even records. For example, a query for the type of the first record in Figure~\ref{f:iot_data} returns \texttt{<temperature=\{time:time}, \texttt{temp:int32\}>}.
%For example, issuing the query \texttt{typeof(this)} over the data \texttt{\{city: ``Broad Cove'' (string), population: 806 (uint32)\} (=city\_schema)} returns \texttt{<temp\_schema=\{time:time,temp:int32\}>}.
In contrast, existing query languages do not enable full type introspection. The \texttt{type} operator in jq, \texttt{TYPE} in N1QL~\cite{n1ql}, and \texttt{typeof} in JavaScript return strings such as \shortorlongform{``number''}{``number'' or ``boolean''} instead of a dedicated type type, and they do not capture the complete structure of complex types, simply returning ``object'' when issued over a JSON-version of each of the records in Figure~\ref{f:iot_data}.
%Unlike the \texttt{<temperature>} type above, the string ``object'' does not help users determine how to query or transform their data.
\shortorlongform{Programming languages have leveraged {\em first-class types} since the 1960s.}{Programming languages have leveraged {\em first-class types} since the 1960s~\cite{fundamental}, and there is some debate over exactly what properties are required in order for an element to be considered ``first class'' in a programming language.} Here we highlight three key first-class properties\shortorlongform{}{~\cite{pop2}} that types have in the \sys{} language:
\begin{CompactItemize}
\item Types can be returned from functions: \texttt{typeof()} returns a type
\item Types can be tested for equality: \texttt{typeof(this)==<temperature>}\footnote{The identifier \texttt{this} refers to input data values one-by-one, so a query for values with \texttt{typeof(this)==<temperature>} returns all \texttt{<temperature>} records.}
\item Types can be declared and passed by value: \newline{} \texttt{type temperature=\{time:time,temp:int32\}}
\end{CompactItemize}
These first-class types enable \sys{} to support rich data introspection, as we will show in Section~\ref{ss:zed_in_action_types}.
\begin{table}[t]
\begin{center}
\small
\begin{tabularx}{\columnwidth}{|p{2.6cm}|X|}
\hline
Operators & \texttt{cut}, \texttt{drop}, \texttt{head}, \texttt{join}, \texttt{put}, \texttt{rename}, \texttt{search}, \texttt{sort}, \texttt{tail}, \texttt{uniq}, \texttt{where} \\
\hline
Functions & \texttt{abs}, {\bf\texttt{cast}}, \texttt{ceil}, \texttt{floor}, {\bf \texttt{is}}, \texttt{log}, \texttt{lower}, {\bf \texttt{nameof}}, \texttt{sqrt}, {\bf \texttt{typeof}}, {\bf \texttt{typeunder}}, \texttt{upper}\\
\hline
Aggregation Functions & \texttt{and}, \texttt{any}, \texttt{avg}, \texttt{count}, \texttt{max}, \texttt{min}, \texttt{or}, \texttt{sum} \\
\hline
\end{tabularx}
\end{center}
\caption{Some of the operators and functions in the \sys{} query language. Bolded functions use types or type names as an input or output.}
\label{t:query_language}
\vspace{-1em}
\end{table}
% typename is another example, but this refers the named type for a string, which is a bit confusing to explain
The \sys{} query language is inspired by the dataflow pipeline pattern of traditional UNIX shells; it operates over a sequence of \sys{} values that can be piped from one operator to the next, though \Zed{} flowgraphs can split and merge the processing pipeline in the form of a directed-acyclic graph. Table~\ref{t:query_language} shows examples of the different components of the \sys{} query language. Dataflow operators take in and output a sequence of \sys{} values, functions appear in experssion context and take in zero or more \sys{} values and output a single \sys{} value, and aggregation functions operate in the context of the \texttt{summarize} operator by taking in a sequence of input values and producing an aggregated value.
Many \sys{} operators and functions leverage type information. The function \texttt{is} tests if a value is of a specified type (e.g., \texttt{is(<temperature>)}). The \texttt{nameof} function returns the string name of a value's type. The \texttt{typeunder} function returns the underlying type of a value, capturing its structure but omitting name information for named types (e.g., \texttt{typeunder} of the first record in Figure~\ref{f:iot_data} is \texttt{<{time:time,} \texttt{temp:int32}>}). Note that \sys{} allows queries to refer to types either with their name (using the \texttt{nameof} and \texttt{typeof} functions), or by their structure (using the \texttt{typeunder} function). This is useful for disambiguating between values that have the same type name but different structures (or vice versa), as with the different \texttt{<temperature>} types shown in Figure~\ref{f:iot_data}.
%These first-class types provide two main benefits in \sys{}. First, first-class types enable \sys{} to provide the same functionality as relational systems, by extracting all data records with a given type. For example, selecting all records with \texttt{typeof(this)==<temperature>} effectively extracts a relational \texttt{temperature} table out of a collection of heterogeneous records. Second, first-class types enable \sys{} to support rich introspection and shaping, as we will show (\S\ref{s:zed_in_action}). \noteamy{maybe move this whole paragraph (after listing the properties) to the section with Zed examples, \S\ref{s:zed_in_action}}
\subsection{\sys{} Format Family} \label{ss:zed_formats}
Prior work has shown that no single format is best for all use cases~\shortorlongform{\cite{one_size_fits_all_2005}}{\cite{one_size_fits_all_2005, one_size_fits_all_2007}}. Thus, \sys{} provides a family of three data formats: a text-based format and two binary formats. \sys{} also supports indexes over its binary formats. Data can be converted between these formats with no loss of information or human involvement because all three implement the \sys{} data model, including its comprehensive typing and ordering of values and fields. This is similar to converting between row-based and columnar relational data~\cite{h2o, peloton} and contrasts with converting between the document and relational models.
A key challenge in designing \sys{}'s data formats is encoding type definitions in a space-efficient way. In \sys{}'s text-based format ZSON, complete type information is implied by the textual structure as with JSON but can be elaborated inline with each individual value, as shown in Figure~\ref{f:iot_data}. Thus a ZSON file will typically be larger than a JSON file when the underlying data requires more detailed type information. In contrast, \sys{}'s binary formats only encode each type definitions efficiently, once per unique type, as each such type is enountered. When a new data type first appears in a sequence, it is assigned a numeric ID and both the ID and type definition are encoded in the stream. All subsequent values with the same type are encoded with the numeric type ID, rather than repeating the complete type definition. As a result, the amount of space required for type definitions in a binary \sys{} file scales linearly with the number of distinct types rather than the number of records. In practice, these type definitions comprise a tiny overhead compared to the data. Briefly, \sys{}'s formats are:
\para{ZSON:} ZSON is a text-based format, similar to JSON but with support for types, as shown in Figure~\ref{f:iot_data}. Moreover, ZSON is a superset of JSON, which makes compatibility with JSON-based legacy systems simple and easy. In our implementation, we have not tried to optimize ZSON to make it particularly performant as performance-critical computation are always carried out in the efficient (and information equivalent) binary formats described below.
\para{ZNG:} ZNG is a binary row-based format, somewhat like Avro~\cite{avro} but with fine-grained typing and support for heterogeneous values within the same file. ZNG interleaves encoded values and encoded type definitions. Each type definition binds the next available numeric type ID to a specific user-defined type. Values are encoded by first encoding the type ID. Then the value itself is encoded using a tag-encoding approach which consists of a tag specifying the value's length and then the value itself, similar to Protocol Buffers~\cite{protobufs}. While formats like JSON are challenging to parse efficiently~\shortorlongform{\cite{mison}}{\cite{mison, sparser}}, ZNG can be parsed quickly because each value's type and length are completely specified by its encoding. This enables a parser to skip records or parts of records whose fields are not relevant to a query.
\para{VNG:} Vector ZNG (VNG) is a binary format which generalizes existing columnar formats~\shortorlongform{\cite{parquet, dremel}}{\cite{parquet, dremel, orc}} by arranging heterogeneous values into vectors\footnote{We use the term "vector" instead of "column" since the hiearchical values are not columns of a table but rather vectors of primitive values formed from each leaf value in a hierarchical type.}. Each VNG file includes the vectors data section, a metadata section of reassembly maps, and a trailer defining the section boundaries. The {\em data section} encodes the vectors of data, where each is a vector of primitive-type values corresponding to leaf elements of a hierarchical data type (e.g., all the \texttt{unit} values in Figure~\ref{f:iot_data}) along with the encoding nulls for non-primitive elements. The {\em metadata section} enumerates all the types in the file and for each type it describes where to locate the vectors of data within the data section. The {\em trailer} includes the size of each section and additional metadata. All three sections are encoded using ZNG, contrasting with Parquet which relies on an additional format (Thrift) for encoding type definitions and other metadata.
%\begin{comment}
\subsection{\sys{∑} Type Contexts} \label{ss:zed_type_contexts}
When data values carry their complete type information, values are idempotent and can be trivially mixed together but explicitly encoding the type in every value can be inefficient. This is why, as described above, a sequence of \sys{} values embeds type definitions only for each unique type encountered in that sequence and each value is merely tagged with a reference to its previously defined type. The state that describes the types encountered is built up dynamically and is referred to as the {\em type context}. The type context binds the set of types in use with a reference to each such type (e.g., a small-integer type ID) so that each typed value in a sequence can be efficiently serialized by simply prefixing its type ID.
However, when multiple sequences with different type contexts are processed together, the type references in general will conflict. For example, in one data stream, the type \texttt{\{x:int64,y:int64\}} might have type ID 33 and in another stream it might have type ID 42. \sys{} resolves this conflict by including a mechanism to efficiently translate types from one type context to another, where streams may be combined by mapping the input type contexts into a shared output context and relabeling each value with its shared-context type.
This mapping process is efficient and involves a simple table lookup per typed value. In the example above, the shared type context might have type ID 35 for type \texttt{\{x:int64,y:int64\}} so the first stream merely needs a map from 33 to 35 and the second stream an entry in a table from 42 to 35. In this way, the output stream has a single consistent type context for all the values that have been combined. Moreover, the values themselves never need be recoded when crossing type-countext boundaries as care was taken in the design of the serialization format so that the value portion of the encoding never include type IDs and thus needs no modification as type IDs are adjusted.
The \sys{} system constructs type-context maps on the fly by adding entries for each newly encountered type using \zed{}'s canonical type value representation (which is a universal encoding not dependent on type IDs). For example, the type encoding of the ZSON type \texttt{\{x:int64,y:int64\}} is an efficient and unique binary representation of that ZSON text. The canonical type value is used as a key into a hash table that maps each type value in that context to its type ID (or to an in-memory pointer to a data structure representing the type). Since type value encodings are universal, they provide an unambiguous and unique syntax for defining any given type and relating any given type from one context to another.
Type-context maps are necessary when a query runs over multiple files simultaneously as each file forms its own type context and values must be merged into a shared context for the query runtime engine. With a shared context, \sys{} can trivially determine if each type in one scope matches a type defined in another scope. Relational systems typically only compare two schemas at once, e.g., to determine if two tables can be \texttt{UNION}ed.
\begin{comment}
\subsection{\sys{} Type Contexts} \label{ss:zed_type_contexts}
Locally scoped types raise a new challenge: how can \sys{} efficiently relate types across different scopes? For example, suppose a user wants to combine two \sys{} files into a single file. With text-based ZSON, simply concatenating the files is sufficient. However, naive concatenation of two binary \sys{} files may result in two different numeric type IDs for the same type (one from each original file), causing queries and storage formats to fail to recognize that the two types are the same. For example, the type \texttt{\{a:int32,b:string\}} may have been assigned ID 6 in the first file and 7 in the second.
\sys{} must relate types across different scopes whenever a user queries multiple files simultaneously or refers to types within a query. Relational systems typically only compare two schemas at once, e.g., to determine if two tables can be \texttt{UNION}ed. In contrast, \sys{} must determine if each value in scope B matches any type defined in scope A, a potentially quadratic number of comparisons.
\sys{} addresses this challenge using {\em type contexts}. A type context represents the set of types that are defined at a given point in a sequence of \sys{} values. As the \sys{} runtime parses a file, it builds up a representation of the type context in memory, caching a mapping from each type definition to its numeric ID and vice versa. As the runtime parses data in a second scope, for each value it encounters with type \texttt{T}, it checks the existing type context to see if \texttt{T} is already defined. For example, the runtime may discover that \texttt{\{a:int32,b:string\}} has already been assigned a type ID of 6 and then re-encode the value with the type ID of 6, instead of its original 7. If \texttt{T} is not yet defined in the type context, the runtime creates a new type definition with the next available type ID. This ensures that a given type is assigned the same ID throughout a file. Because each type context lookup takes constant time, merging type contexts takes time linear in the number of data values.
\end{comment}