Replies: 11 comments 23 replies
-
One of the sub-goals that I think is ideal: a plan producer should be able to probe a plan consumer and get a list of function signatures. In that situation, it would be ideal if a user form of the function would be automatically available. As such, you really want "name" to be something a user can type when trying to construct a plan. This suggests that names shouldn't be ugly (as they are in the case of One of the dynamics I was torn on when producing the initial specification as to whether or not it makes sense to have additional information in a key. At one point I was thinking that an additional field which is "error behavior" could be useful as a primary key with values like "error, overflow, null" so that a producing system could be configured globally and then bind to the appropriate sub-instance of the function depending on their need. |
Beta Was this translation helpful? Give feedback.
-
A key property of functions as currently envisioned is that they can only determine output type based on the types of the inputs. Another key concept that came up was functions that take what are akin to symbols or enums. For example, ideally we could have a cast function that is defined as CAST(<type>, <input>). However, in order for that to work you need to actually inspect the <type> value to determine the output type. I've started thinking this should be allowed by us defining an additional internal type called "ENUM" that only supports a predefined set of literals within that function (not calculations) and can be used as part of resolving the output type of an expression. |
Beta Was this translation helpful? Give feedback.
-
Should options be expressed as arguments (e.g. maybe arguments that have to be scalar) or is there a special options object? Or do all options need to be "fields" capable of being scalars or arrays? At the moment, in Arrow, it can be a bit of a game to choose which one something is. For example:
|
Beta Was this translation helpful? Give feedback.
-
Just to be overly clear. You are saying there is one "function" but there can still be multiple variations of said function based on the data types correct? For example, in Arrow we have "functions" (and there is indeed just one) and each function has many different "kernels" which express the different variations. |
Beta Was this translation helpful? Give feedback.
-
Another question that comes up with function signatures is casting. For example, if a system supports I don't think Substrait needs to define This helps keep Substrait from needing to define where casting makes sense and where it doesn't. For example, I can cast most things to string but it wouldn't make sense to support find/replace on an integer just because I can cast the integer to a string. On the other hand, you could say that Substrait is interested in decided where casting makes sense and where casting does not make sense because then we avoid inconsistency across systems? I don't know, it's possible I haven't thought this out too much. |
Beta Was this translation helpful? Give feedback.
-
What happens in your above example if a system (maybe a mature system that isn't likely to change) only supports some subset of the variations of a function (for example, it supports add but not add_error semantics). Should we support a "DontCare" or "Wildcard" option value to allow plan producers to write plans that are maximally consumable? In other words, could a plan ask for an add operation but be ok with either add or add_error semantics? |
Beta Was this translation helpful? Give feedback.
-
Where do we plan to use these function specifications? In other words, what is the use case for a serializable function definition? In theory Substrait could define what functions are supposed to do without ever creating a serializable specification for functions. That being said I can think of two use cases for serialization off the top of my head. Are there others?
|
Beta Was this translation helpful? Give feedback.
-
I was doing some testing work as part of Arrow and I needed to be able to compactly represent the list of possible function permutations so that I could verify that Arrow supported all permutations. I ended up defining a simplistic DSL for functions. I'm not recommending it for Substrait as is because I'm sure it could be improved on and the goals were slightly different (compactness and human readable/writable were higher priorities). An example of this can be found here. (FYI, don't focus too much on the output types as many of them are wrong because I haven't started testing that part yet). I think most of the things I learned are fairly compatible with the direction things are going. Once the spec is a bit closer to something serializable I hope to attempt converting my JSON Arrow representation to a Substrait representation. So this point is maybe not the greatest "discussion" point but I keep referencing "my Arrow exercise" in some of these discussions so I figured it worth describing what that was. |
Beta Was this translation helpful? Give feedback.
-
I've updated the key parts of the description of behavior for output derivation based on our conversation last week. Would love your opinion on the updates here. Please focus on the updated text for argument types and beyond. I'll work on updating the protobuf sketch in the meantime. Thanks! |
Beta Was this translation helpful? Give feedback.
-
I believe the current approach is for "options" to be expressed as constant arguments. I don't know if this is a bad thing or not but will it lead to an abundance of extension types? For example, null selection behavior is a popular option:
Declaring this as a constant int8 option doesn't express much. I could then create an extension type which would clarify the options but it seems like this may lead to an abundance of extension types. Alternatively, do we want an enum logical type? I guess it wouldn't pass the 2-systems rule. |
Beta Was this translation helpful? Give feedback.
-
How does a consumer express partial support for function options? Again, considering my example:
What happens if I, a consumer, only support EMIT_NULL. Does a producer have to make a plan, send the plan to the consumer, and wait for the consumer to reject the plan? Or is there some way the consumer could communicate which subset of option values it supports? |
Beta Was this translation helpful? Give feedback.
-
A key part of the substrait specification is a clear list of function definitions and their semantics. To define these semantics, each function needs to be defined using the extensions system (see #29). The current specification defines a number of properties for each scalar function definition. These include argument types, output derivation strategy, description, etc. As part of the substrait specification, it is expected that the substrait project will declare all of the standard operations that most processing engines have. As part of these definitions, semantics for the specific definition need to also be declared. Discrete function signatures should be declared for each different variation of a function. For example, an add operation will have plenty of variations. For example, one might declare each of the following:
add(i8,i8) => i8 (add which silently allows overflow)
add_error(i8, i8) => i8 (add which errors on overflow)
add(i32,i32) => i32 (add which silently allows overflow)
add_error(i32, i32) => i8 (add which errors on overflow)
A function's identity key is based on:
In a particular organization, it is expected that there can be only one function with each key. When a plan declares a function evaluation, it must use be referencing a specific function version.
Some questions that came up while sketching things out.
I suggest that we keep the discussion of representation/extensions in #29 and keep this focused on the meat of what a signature is and how it should be interpreted.
Beta Was this translation helpful? Give feedback.
All reactions