Spec Formalization: Custom Output Derivation for Function Signatures #32

jacques-n · 2021-09-14T23:25:29Z

jacques-n
Sep 14, 2021
Maintainer

Functions that work entirely with simple (non-compound) types typically have very simple semantics. For example: add(i32, i32) => i32. These kinds of functions have what is declared in function signatures as "direct" output type. For compound types, derivation strategies can be more complex. A couple of examples:

A function such as func(List<T>) =>T
A function such as func(decimal(p1, s1), decimal(p2,s2)) => decimal(max(p1+p2,38),min(s1,s2))

To correctly validate plans, the function signatures and the Substrait specification must support these arbitrary output type derivation strategies. As such, we need some way to define the arbitrary logic associated with type derivation so that any time a function is used in a plan, the output type is clearly specified/determinate. So the question is: what is the best way to support this to be portable across systems in as simple a way as possible?

wesm · 2021-09-15T01:49:54Z

wesm
Sep 15, 2021

If the type derivations are to live outside of a Substrait expression, then in the general case, I beleive you would need to define a "simple" type derivation grammar for function signatures (for example, list_append(v: list<T>, v: T) => list<T>). The downside is that implementations have to implement that grammar

3 replies

jacques-n Sep 15, 2021
Maintainer Author

I was hoping we could find some kind of simple expression language that could be converted to various targets such c code, llvm bytecode, etc. I worry that a custom grammar would get turn into a language quickly and almost be a painful constraint for function developers. I spent a little time looking at some examples. For example, decimal derivation behaviors. In that case you have pretty simple math (including min/max).

I also looked at another operation that I remember having custom derivation at Dremio. It uses loops/lists.

I also wonder if we need to be able to throw errors for some conditions which (I guess) would mean that the specific compound parameters are unbindable to that function.

westonpace Sep 22, 2021
Maintainer

It's possible that parameterization takes us 90% of the way there and we can simply define functions for the remaining behaviors. For example:

take_first:
    parameters:
        - T
    args:
        - LIST<T>
    output:
        type: T

add:
    parameters:
        - P1
        - S1
        - P2
        - S2
    args:
        - Decimal<P1, S1>
        - Decimal<P2, S2>
    output:
        function_name: "DECIMAL_PROMOTE"
        args:
            - P1
            - S1
            - P2
            - S2

Then extensions are welcome to define whatever "type functions" they want but the actual arithmetic doesn't have to be expressed in the plan itself.

jacques-n Sep 22, 2021
Maintainer Author

I'd like to avoid the opaque functions in type derivation (at least for as long as possible). It makes it hard for tools to figure out what a function means to a plan unless they have the implementation. I think my proposed expression capabilities are pretty narrow and solve most cases.

westonpace · 2021-09-18T02:45:40Z

westonpace
Sep 18, 2021
Maintainer

A similar (but maybe easier to handle) case is a kernel like drop_null or count which supports all input types (even those extension types not yet defined).

Come to think of it, do we need to state the fact that all types are potentially nullable anywhere? (hopefully that isn't contentious)

1 reply

westonpace Sep 18, 2021
Maintainer

Actually, I guess this is sort of the same as the list example.

westonpace · 2021-09-22T03:17:01Z

westonpace
Sep 22, 2021
Maintainer

It may not add much but for completeness sake I'd argue there may be some cases where parameterization is required to properly define input types as well. For example:

APPEND<T>(LIST<T>, T) -> LIST<T>

One of the more bizarre kernels in Arrow at the moment is case_when which is roughly defined as...

CASE_WHEN<T, N>(STRUCT<N1:boolean, N2:boolean, ..., NN:boolean>, T, T, ... T) -> T

...although I'm not sure why it can't be...

CASE_WHEN<T>(LIST<boolean>, T, T, ... T) -> T

Even with this form though, parameterization is still arguably needed to ensure all of the T args are the same type.

1 reply

jacques-n Sep 22, 2021
Maintainer Author

I've proposed CASE as its own expression type to avoid any confusion/complexities with it. Having thought more about this and the comments here, I think we can start with something fairly simple. Take a look at PR here: #40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec Formalization: Custom Output Derivation for Function Signatures #32

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Spec Formalization: Custom Output Derivation for Function Signatures #32

jacques-n Sep 14, 2021 Maintainer

Replies: 3 comments · 5 replies

wesm Sep 15, 2021

jacques-n Sep 15, 2021 Maintainer Author

westonpace Sep 22, 2021 Maintainer

jacques-n Sep 22, 2021 Maintainer Author

westonpace Sep 18, 2021 Maintainer

westonpace Sep 18, 2021 Maintainer

westonpace Sep 22, 2021 Maintainer

jacques-n Sep 22, 2021 Maintainer Author

jacques-n
Sep 14, 2021
Maintainer

Replies: 3 comments 5 replies

wesm
Sep 15, 2021

jacques-n Sep 15, 2021
Maintainer Author

westonpace Sep 22, 2021
Maintainer

jacques-n Sep 22, 2021
Maintainer Author

westonpace
Sep 18, 2021
Maintainer

westonpace Sep 18, 2021
Maintainer

westonpace
Sep 22, 2021
Maintainer

jacques-n Sep 22, 2021
Maintainer Author