Spec Formalization: Scalar Function Signatures #31

jacques-n · 2021-09-14T22:30:03Z

jacques-n
Sep 14, 2021
Maintainer

A key part of the substrait specification is a clear list of function definitions and their semantics. To define these semantics, each function needs to be defined using the extensions system (see #29). The current specification defines a number of properties for each scalar function definition. These include argument types, output derivation strategy, description, etc. As part of the substrait specification, it is expected that the substrait project will declare all of the standard operations that most processing engines have. As part of these definitions, semantics for the specific definition need to also be declared. Discrete function signatures should be declared for each different variation of a function. For example, an add operation will have plenty of variations. For example, one might declare each of the following:

add(i8,i8) => i8 (add which silently allows overflow)
add_error(i8, i8) => i8 (add which errors on overflow)
add(i32,i32) => i32 (add which silently allows overflow)
add_error(i32, i32) => i8 (add which errors on overflow)

A function's identity key is based on:

name
argument types

In a particular organization, it is expected that there can be only one function with each key. When a plan declares a function evaluation, it must use be referencing a specific function version.

Some questions that came up while sketching things out.

Should there be special support/handling for binary operators? (e.g. +,-,<, etc)
Should we have a function version for each function signature? What happens if someone redefines or changes the semantics of a function?
How to define complex behavior definitions? Start with only allowing definition in Substrait? Come up with some kind of way to express output type derivation using something language agnostic (e.g. a WebAssembly scripting language)? I suggest that we formalize the rest of the function specification and have a subgroup tackle this while we move onto other portions of the specification. I've started Spec Formalization: Custom Output Derivation for Function Signatures #32 to discuss this.

I suggest that we keep the discussion of representation/extensions in #29 and keep this focused on the meat of what a signature is and how it should be interpreted.

jacques-n · 2021-09-14T22:34:01Z

jacques-n
Sep 14, 2021
Maintainer Author

One of the sub-goals that I think is ideal: a plan producer should be able to probe a plan consumer and get a list of function signatures. In that situation, it would be ideal if a user form of the function would be automatically available. As such, you really want "name" to be something a user can type when trying to construct a plan. This suggests that names shouldn't be ugly (as they are in the case of add_error above).

One of the dynamics I was torn on when producing the initial specification as to whether or not it makes sense to have additional information in a key. At one point I was thinking that an additional field which is "error behavior" could be useful as a primary key with values like "error, overflow, null" so that a producing system could be configured globally and then bind to the appropriate sub-instance of the function depending on their need.

2 replies

westonpace Sep 18, 2021
Maintainer

My understanding of what you have written is that there will need to be a unique name (key?) for each possible variation in function behavior. I'm not sure this is practical as there is a surprising amount of choices for even simple operations. So by this argument I would lean towards add_error being add and add has an option controlling error behavior. I've got some additional thoughts on "options" I'll add as separate discussion points.

jacques-n Sep 20, 2021
Maintainer Author

My thinking of key is that it would be "implicit". See my comment below where key might be add[0][error:OVERFLOW_SILENTLY]

jacques-n · 2021-09-14T22:36:35Z

jacques-n
Sep 14, 2021
Maintainer Author

A key property of functions as currently envisioned is that they can only determine output type based on the types of the inputs. Another key concept that came up was functions that take what are akin to symbols or enums. For example, ideally we could have a cast function that is defined as CAST(<type>, <input>). However, in order for that to work you need to actually inspect the <type> value to determine the output type. I've started thinking this should be allowed by us defining an additional internal type called "ENUM" that only supports a predefined set of literals within that function (not calculations) and can be used as part of resolving the output type of an expression.

4 replies

cpcloud Sep 21, 2021
Maintainer

Is the ENUM approach too generic at the moment? Are there other examples of things where we'd need that approach, or can we special case cast for now?

westonpace Sep 24, 2021
Maintainer

From my exercise with Arrow I found that I needed to know (I'm not 100% sure I'm using the term constant argument correctly here. In Arrow terms this would be "function options"):

The other types
The value of potential constant arguments
The arity of the function

Some examples:

strptime has a constant argument "units" which describes the units the output timestamps should have (e.g. 'us', 'ms') [in some ways strptime is a cast though]
assume_timezone is Arrow specific (as it has a parameterized time zone) but it is the timestamp -> timestamptz function and requires knowing the destination time zone to compute the output type. This would not be a concern for a substrait system
make_struct is an Arrow function which takes in multiple arrays and a constant argument (list of field names) and returns a single struct array. Inspection of both the input types and the constant argument is needed in this case.
A truncate / round kernel for decimal types that truncates the precision to a target decimal precision
A make_fixed_list function which takes in X arguments of the same type and returns a fixed_size_list needs to know the arity of the function to create the fixed size list

jacques-n Sep 24, 2021
Maintainer Author

These are great examples to work against, thanks. Here is some categorization. I'm ignoring the fact that a couple of these aren't actually datatypes in substrait.

Constant Type	Description	Examples
Enum	A list of known values defined as part of the function definition	strptime, assume_timezone
Integer	A defined range of integer64 numbers	truncate, make_fixed_list
Type	A type declaration within substrait	make_struct, cast

Is it fair that those are three subcategories? Do we have other things that don't fall into one of those?

jacques-n Sep 24, 2021
Maintainer Author

Honestly, as I think about it more, make_struct actually falls into the category of what I was expressing as the projection equivalent of the field references "Masked Complex Expression". I'm inclined to think that we should not try to deal with that in the context of function definitions. It feels too first class.

westonpace · 2021-09-18T02:17:41Z

westonpace
Sep 18, 2021
Maintainer

Should options be expressed as arguments (e.g. maybe arguments that have to be scalar) or is there a special options object? Or do all options need to be "fields" capable of being scalars or arrays? At the moment, in Arrow, it can be a bit of a game to choose which one something is.

For example:

A list extraction kernel which takes a list array (an column where each element is a list) and a list of indices and returns a list of values. Or is the index to extract an option and the same for all rows? This one we decided made more sense as a separate argument.
Is NA treated as a null in a drop_null or a is_null function? This one is currently listed as an option.
String manipulation kernels that take in a regular expression or a find/replace pair currently treat this as an option but it could easily be seen as an argument.

3 replies

jacques-n Sep 20, 2021
Maintainer Author

I introduced the concept of "constant" arguments in the argument type definitions which I think fills half the need. The big questions are:

Can output derivation get constant values during derivation determination?
Is exposing as an argument like this make it harder for a tool to expose signatures directly to end users? Maybe we should introduce "options" as things engines define, not users and they are passed separately?

cpcloud Sep 21, 2021
Maintainer

It seems like it would require additional work to hide constant values from output determination. Wouldn't the constants be available immediately?

jacques-n Sep 24, 2021
Maintainer Author

I wouldn definitely say that at least in some implementations, the data is entirely separate from the types and it is NOT easier to expose that information :D

Making all constant data available to a type expression makes the type expression language much more complex and starts turning it into a new programming language.

westonpace · 2021-09-18T02:19:17Z

westonpace
Sep 18, 2021
Maintainer

In a particular organization, it is expected that there can be only one function with each key. When a plan declares a function evaluation, it must use be referencing a specific function version.

Just to be overly clear. You are saying there is one "function" but there can still be multiple variations of said function based on the data types correct? For example, in Arrow we have "functions" (and there is indeed just one) and each function has many different "kernels" which express the different variations.

1 reply

jacques-n Sep 20, 2021
Maintainer Author

As I've thought more about this, I'm inclined to a structure something like this (as opposed to what is currently sketched in spec/extension files):

add:

args: [i32,i32]
options:
- error: [OVERFLOW_SILENTLY, FAIL, NULL_OUT]
  output: i32
args: [i64,i64]
output: i64
- variation: [OVERFLOW_SILENTLY, FAIL, NULL_OUT]

so a plan would bind add[0][error:OVERFLOW_SILENTLY] according to the appropriate expectations.

westonpace · 2021-09-18T02:36:55Z

westonpace
Sep 18, 2021
Maintainer

Another question that comes up with function signatures is casting. For example, if a system supports ADD(int16, int16) then it can support ADD(int8, int16) because it can cast up one of the inputs.

I don't think Substrait needs to define ADD(int8, int16) however because, for plans that choose to support it, they can define it as ADD(CAST(x, int16), y).

This helps keep Substrait from needing to define where casting makes sense and where it doesn't. For example, I can cast most things to string but it wouldn't make sense to support find/replace on an integer just because I can cast the integer to a string. On the other hand, you could say that Substrait is interested in decided where casting makes sense and where casting does not make sense because then we avoid inconsistency across systems? I don't know, it's possible I haven't thought this out too much.

3 replies

jacques-n Sep 20, 2021
Maintainer Author

My expectation is that the plan should always be fully bound (e.g. casts are explicitly included as necessary). Substrait would have declarations for cast functions but it would be up to the plan producer to insert where they want and use the appropriate functions as expected.

cpcloud Sep 21, 2021
Maintainer

+1 for keeping specific choices about when to cast out of Substrait. Those semantics are different across systems, and I think more people will adopt the spec if the spec supports whatever casting behavior they want.

westonpace Sep 24, 2021
Maintainer

I've also come around to +1 that casts should be explicitly included as necessary. Are the two of you saying the same thing though?

Substrait would have declarations for cast functions

Those semantics are different across systems, and I think more people will adopt the spec if the spec supports whatever casting behavior they want.

I think my understanding is: Substrait will define a minimum subset of valid cast operations but it will not define when adding a cast to the plan is the right thing to do. Implementations will need to figure this out and are free to add their own internal casting options on top of Substrait's defined set of casts.

Hopefully this is in agreement with both of your points.

westonpace · 2021-09-18T02:40:04Z

westonpace
Sep 18, 2021
Maintainer

What happens in your above example if a system (maybe a mature system that isn't likely to change) only supports some subset of the variations of a function (for example, it supports add but not add_error semantics). Should we support a "DontCare" or "Wildcard" option value to allow plan producers to write plans that are maximally consumable? In other words, could a plan ask for an add operation but be ok with either add or add_error semantics?

1 reply

jacques-n Sep 20, 2021
Maintainer Author

Rather than think of that as core to substrait specification itself, my expectation has been that the community would build a library of common plan transformations and so people could use the fallback behavior they wanted (substrait polyfills). If you don't have X, then substrait will do something with a rewrite to Y. You pick the polyfills that you're okay with.

Some polyfills would be lossy and thus a user/builder would need to decide and declare what polyfills they are applying to "roughly" consume the intention. For example, imagine a polyfill for a system that only supports fp64 but not Decimal. It could rewrite a plan to work using fp64 as opposed to Decimal. It's lossy but may be fine in certain circumstances. Another polyfill might be converting count distinct to approximate count distinct for systems that have a restricted capacity for count distinct. I can also imagine a polyfill for window frame operations when a system doesn't support it directly (I think).

westonpace · 2021-09-24T11:58:58Z

westonpace
Sep 24, 2021
Maintainer

Where do we plan to use these function specifications? In other words, what is the use case for a serializable function definition? In theory Substrait could define what functions are supposed to do without ever creating a serializable specification for functions. That being said I can think of two use cases for serialization off the top of my head. Are there others?

Systems which wish to test that they conform to the Substrait specification may wish to consume the list of functions in a serializable format so that they can run tests to ensure their target system supports all of the possible permutations. These systems will still need to verify "corectness" of a function with some value outside of Substrait (although, given a spec, two systems could presumably be tested against each other). In this case the spec only needs to be expressive enough to capture the functions described by Substrait.
Systems will (I think but I don't know that I've seen it formally stated yet) need to inform other systems of what functions they provide. In this case this may include functions that live outside of Substrait and so the specification would need to be general enough to be useful for systems to express the full range of capabilities.

1 reply

jacques-n Sep 24, 2021
Maintainer Author

It comes down to extensibility. If we want systems to be able to work with functions that they didn't know at compile time, we need a way to communicate the behavior of those functions. It could be in a single system or communication between two systems. For sure an example is an engine has a udf and wants to inform a producer of that udf. In that case, the producer wants to expose the function to an end user but validate the plan before submitting it. To do so, they need to know the function input and derivation rules.

westonpace · 2021-09-24T12:11:40Z

westonpace
Sep 24, 2021
Maintainer

I was doing some testing work as part of Arrow and I needed to be able to compactly represent the list of possible function permutations so that I could verify that Arrow supported all permutations. I ended up defining a simplistic DSL for functions. I'm not recommending it for Substrait as is because I'm sure it could be improved on and the goals were slightly different (compactness and human readable/writable were higher priorities). An example of this can be found here. (FYI, don't focus too much on the output types as many of them are wrong because I haven't started testing that part yet).

I think most of the things I learned are fairly compatible with the direction things are going. Once the spec is a bit closer to something serializable I hope to attempt converting my JSON Arrow representation to a Substrait representation.

So this point is maybe not the greatest "discussion" point but I keep referencing "my Arrow exercise" in some of these discussions so I figured it worth describing what that was.

2 replies

jacques-n Sep 24, 2021
Maintainer Author

Thanks for sharing. I'd say the main observation I have about your initial implementation is that it seems like your scheme is quite verbose. For example, you have: <T:string>(T)=>T. Why not just make that string => string?

westonpace Sep 30, 2021
Maintainer

Ease of parsing but I see the point

jacques-n · 2021-09-29T03:21:52Z

jacques-n
Sep 29, 2021
Maintainer Author

I've updated the key parts of the description of behavior for output derivation based on our conversation last week. Would love your opinion on the updates here. Please focus on the updated text for argument types and beyond. I'll work on updating the protobuf sketch in the meantime. Thanks!

1 reply

jacques-n Sep 29, 2021
Maintainer Author

CC @westonpace , @cpcloud @rdblue

westonpace · 2021-09-30T01:18:44Z

westonpace
Sep 30, 2021
Maintainer

I believe the current approach is for "options" to be expressed as constant arguments. I don't know if this is a bad thing or not but will it lead to an abundance of extension types? For example, null selection behavior is a popular option:

  enum NullSelectionBehavior {
    /// The corresponding filtered value will be removed in the output.
    DROP,
    /// The corresponding filtered value will be null in the output.
    EMIT_NULL,
  };

Declaring this as a constant int8 option doesn't express much. I could then create an extension type which would clarify the options but it seems like this may lead to an abundance of extension types. Alternatively, do we want an enum logical type? I guess it wouldn't pass the 2-systems rule.

4 replies

westonpace Oct 6, 2021
Maintainer

Gentle ping @jacques-n

jacques-n Oct 6, 2021
Maintainer Author

Thanks, forgot about this one.

When we had the call a week or so ago, my interpretation of the discussion was that we wouldn't have anything in the signatures themselves to support this pattern. Our conversation, though, was mostly focused on the type derivation stuff so I can see how this got lost.

I'm somewhat torn on the situation. Options I see:

Option A:, stay as currently stated and people can use constant string arguments if they'd like. This at least makes things clearer than a i8. At the same time, it means that substrait really can't do much in the way of validation (and neither can a frontend system that gets a list of function signatures).
Option B: Propose actually support enum arguments for and then FE systems can expose/validate/etc (as well as substrait intermediaries). This would not be a datatype in substrait, but a third type of argument. Basically add another option to the current set of value and type arguments here.

Note, in the latest yaml function list, I proposed something which is slightly different: behavioral options which could be specify as part of the function selection process but are generally envisioned as internal concerns of the system as opposed to external user behavior properties.

As I look at the things I did in the latest yaml, I wonder if we'd be better off just formalizing as option B. It really feels like this is a real thing and treating it as a special kind of "enum constant" or similar would be best.

westonpace Oct 7, 2021
Maintainer

There was some discussion on this discussion topic on Slack (https://substrait.slack.com/archives/C02D7CTQXHD/p1633566594104600). In summary I was a little behind on what "options" were in the current functions yaml.

Neither of us was in favor of "Option A" in the above post (using string arguments). One place where this is used that could be improved was:

- name: extract
  description: Extract portion of a date/time value.
  parameters:
    - D: [timestamp, timestamp_tz, date, time]
  arguments:
    - type: D
      name: Date/time value to extract information from.
    - type: STRING
      name: The part of the value to extract.
      constant: TRUE
  return: i64

Having type: STRING for the second argument is undesirable because it makes validation difficult. For that reason it seems that enum arguments do make sense.

Once we have enum arguments then the current "options" feature becomes a little redundant. For example:

  - name: '+'
  description: "Add two numeric values."
  options:
    overflow: [SILENT, SATURATE, ERROR]
  variants:
    - variant: scalar
      parameters:
        - K: [i8,i16,i32,i64,fp32,fp64]
      arguments:
        - type: K
        - type: K
      return: K

The options: overflow could easily be a third argument to add which has a type of "overflow enum".

This avoids the need for extension types.

jacques-n Oct 7, 2021
Maintainer Author

I've posted #49 to try to add this functionality. We should also add the explicit resolution example you gave here.

westonpace · 2021-09-30T01:20:23Z

westonpace
Sep 30, 2021
Maintainer

How does a consumer express partial support for function options? Again, considering my example:

  enum NullSelectionBehavior {
    /// The corresponding filtered value will be removed in the output.
    DROP,
    /// The corresponding filtered value will be null in the output.
    EMIT_NULL,
  };

What happens if I, a consumer, only support EMIT_NULL. Does a producer have to make a plan, send the plan to the consumer, and wait for the consumer to reject the plan? Or is there some way the consumer could communicate which subset of option values it supports?

1 reply

westonpace Oct 7, 2021
Maintainer

From the discussion on slack:

The argument in the yaml would have an enum type
The argument would have a possible value of UNSPECIFIED (this may also be the default value)
If the producer chooses "UNSPECIFIED" then the consumer should consider the enum definition to be an ordered list of choices and should choose the first choice that the consumer supports.

For example, given an add function with an argument overflow which is an enum option with the choices UNSPECIFIED, SILENT, SATURATE, ERROR and given a consumer that only supports SATURATE and ERROR:

If the producer picks SILENT then the consumer should return an error
If the producer picks ERROR then the consumer should use ERROR
If the producer picks UNSPECIFIED then the consumer should use SATURATE

If an enum option does not have UNSPECIFIED as a choice then it is always an error if the consumer does not support it (for example, the enum used for time extraction given in the other discussion).

Spec Formalization: Scalar Function Signatures #31

jacques-n Sep 14, 2021 Maintainer

Replies: 11 comments · 23 replies

jacques-n Sep 14, 2021 Maintainer Author

westonpace Sep 18, 2021 Maintainer

jacques-n Sep 20, 2021 Maintainer Author

jacques-n Sep 14, 2021 Maintainer Author

cpcloud Sep 21, 2021 Maintainer

westonpace Sep 24, 2021 Maintainer

jacques-n Sep 24, 2021 Maintainer Author

jacques-n Sep 24, 2021 Maintainer Author

westonpace Sep 18, 2021 Maintainer

jacques-n Sep 20, 2021 Maintainer Author

cpcloud Sep 21, 2021 Maintainer

jacques-n Sep 24, 2021 Maintainer Author

westonpace Sep 18, 2021 Maintainer

jacques-n Sep 20, 2021 Maintainer Author

westonpace Sep 18, 2021 Maintainer

jacques-n Sep 20, 2021 Maintainer Author

cpcloud Sep 21, 2021 Maintainer

westonpace Sep 24, 2021 Maintainer

westonpace Sep 18, 2021 Maintainer

jacques-n Sep 20, 2021 Maintainer Author

westonpace Sep 24, 2021 Maintainer

jacques-n Sep 24, 2021 Maintainer Author

westonpace Sep 24, 2021 Maintainer

jacques-n Sep 24, 2021 Maintainer Author

westonpace Sep 30, 2021 Maintainer

jacques-n Sep 29, 2021 Maintainer Author

jacques-n Sep 29, 2021 Maintainer Author

westonpace Sep 30, 2021 Maintainer

westonpace Oct 6, 2021 Maintainer

jacques-n Oct 6, 2021 Maintainer Author

westonpace Oct 7, 2021 Maintainer

jacques-n Oct 7, 2021 Maintainer Author

westonpace Sep 30, 2021 Maintainer

westonpace Oct 7, 2021 Maintainer

jacques-n
Sep 14, 2021
Maintainer

Replies: 11 comments 23 replies

jacques-n
Sep 14, 2021
Maintainer Author

westonpace Sep 18, 2021
Maintainer

jacques-n Sep 20, 2021
Maintainer Author

jacques-n
Sep 14, 2021
Maintainer Author

cpcloud Sep 21, 2021
Maintainer

westonpace Sep 24, 2021
Maintainer

jacques-n Sep 24, 2021
Maintainer Author

jacques-n Sep 24, 2021
Maintainer Author

westonpace
Sep 18, 2021
Maintainer

jacques-n Sep 20, 2021
Maintainer Author

cpcloud Sep 21, 2021
Maintainer

jacques-n Sep 24, 2021
Maintainer Author

westonpace
Sep 18, 2021
Maintainer

jacques-n Sep 20, 2021
Maintainer Author

westonpace
Sep 18, 2021
Maintainer

jacques-n Sep 20, 2021
Maintainer Author

cpcloud Sep 21, 2021
Maintainer

westonpace Sep 24, 2021
Maintainer

westonpace
Sep 18, 2021
Maintainer

jacques-n Sep 20, 2021
Maintainer Author

westonpace
Sep 24, 2021
Maintainer

jacques-n Sep 24, 2021
Maintainer Author

westonpace
Sep 24, 2021
Maintainer

jacques-n Sep 24, 2021
Maintainer Author

westonpace Sep 30, 2021
Maintainer

jacques-n
Sep 29, 2021
Maintainer Author

jacques-n Sep 29, 2021
Maintainer Author

westonpace
Sep 30, 2021
Maintainer

westonpace Oct 6, 2021
Maintainer

jacques-n Oct 6, 2021
Maintainer Author

westonpace Oct 7, 2021
Maintainer

jacques-n Oct 7, 2021
Maintainer Author

westonpace
Sep 30, 2021
Maintainer

westonpace Oct 7, 2021
Maintainer