Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend "unfold" operation and support it in the compiler plugin #742

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

koperagen
Copy link
Collaborator

@koperagen koperagen commented Jun 18, 2024

It covers two interesting use cases:

  1. Replace column with multiple, potentially nested. Provides DSL similar to add for that
  2. More fine-grained toDataFrame. Instead of converting 20-30 properties to 2-3 level of nesting all at once user can choose to convert toDataFrame(maxDepth = 0) and unfold required properties to whatever level they need

On the compiler plugin side i will continue to support other overloads later in different PR

…rt them as compilation errors. Add special constructor for errors that shouldn't be caught
Interpreters need an ability to pass arguments down to DSL, so introduce new "dsl" factory function
@koperagen koperagen added the enhancement New feature or request label Jun 18, 2024
@koperagen koperagen added this to the 0.14.0 milestone Jun 18, 2024
@koperagen koperagen self-assigned this Jun 18, 2024
Copy link
Contributor

Generated sources will be updated after merging this PR.
Please inspect the changes in here.

@Jolanrensen Jolanrensen self-requested a review June 25, 2024 13:35
Copy link
Collaborator

@Jolanrensen Jolanrensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, why did you decide to put unfold after replace?
Unfold itself, by definition, already replaces a column with a new column. I think we're making the API more complicated than it needs to be by making our users write df.replace { a }.unfold {} instead of keeping it inside the unfold operation, like df.unfold { a }.by {}. You can even allow df.unfold(maxDepth = 2) { a }. It would keep "replace with" simple and "unfold" more powerful.

return when (kind()) {
ColumnKind.Group, ColumnKind.Frame -> this
else -> when {
skipPrimitive && isPrimitive() -> this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was very confused, like how can you unfold a primitive? but it's an isPrimitive() which can be a collection too... Can we rename isPrimitive() to something like isPrimitiveOrListLike()? unfold seems to be the only operation using it

Copy link
Collaborator Author

@koperagen koperagen Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't. Have a look unfold primitive test. skipPrimitive = false is needed to make it work, and skipPrimitive = true is needed to avoid unpacking for example a column of String to ColumnGroup, size: Int, the same as we do for toDataFrame with overloads

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but can you take a look at the isPrimitive() function? That function also returns true when you run in on a collection and an array. My suggestion was only to rename the isPrimitive() function.

public inline fun <reified T> DataColumn<T>.unfold(noinline body: CreateDataFrameDsl<T>.() -> Unit): AnyCol =
unfoldImpl(skipPrimitive = false, body)

public inline fun <T, reified C> ReplaceClause<T, C>.unfold(vararg props: KProperty<*>, maxDepth: Int = 0): DataFrame<T> =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name unfolding would read better, or byUnfolding/withUnfolded. replace {}.unfold {} doesn't read as a sentence anymore.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible, let's avoid motion or gravity to the native language, I believe, it's not a goal

@koperagen
Copy link
Collaborator Author

koperagen commented Jun 26, 2024

Unfold itself, by definition, already replaces a column with a new column

What definition?
Originally i wanted a add replace by function with AddDsl. Then i remembered that toDataFrame DSL is pretty similar to AddDsl and we already have unfold function, it just lacks an overload with a DSL. Such an overload must be a multiplex operation, right?
unfold by is interesting, but it will have exactly one operation probably forever. Sounds good, right, but its semantics is no different from replace.
I'd rather have one entry point (replace) for situations when you want to, like, replace a column with a different one.

@koperagen koperagen requested a review from Jolanrensen June 26, 2024 14:32
@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jun 26, 2024

What definition? Originally i wanted a add replace by function with AddDsl. Then i remembered that toDataFrame DSL is pretty similar to AddDsl and we already have unfold function, it just lacks an overload with a DSL. Such an overload must be a multiplex operation, right? unfold by is interesting, but it will have exactly one operation probably forever. Sounds good, right, but its semantics is no different from replace. I'd rather have one entry point (replace) for situations when you want to, like, replace a column with a different one.

I meant the definition of "unfolding", you're unfolding a column with its contents, so its type is bound to change, aka, the column is replaced.

I can see where you're coming from, but it still may be hard for users to have two different ways to unfold, namely .replace {}.unfold {} and .unfold() and both work a little bit different and have different arguments.

So I'd either:

  • make df.replace {}.byUnfolding {} (or a name like that) the only version of unfold and deprecate the old one
  • or make just df.unfold {} more powerful
  • or keep both for discoverability, but both should be equally powerful

wdyt?

@koperagen
Copy link
Collaborator Author

I'd say df.unfold should stay, because use case for simply unfolding a column with objects stays. Worth to add df.unfold(maxDepth = , roots = ) overload too, missed it.

So I'd either:
or make just df.unfold {} more powerful

Please write this API with needed overloads and a few examples of its usage then

public inline fun <reified T> DataColumn<T>.unfold(noinline body: CreateDataFrameDsl<T>.() -> Unit): AnyCol =
unfoldImpl(skipPrimitive = false, body)

public inline fun <T, reified C> ReplaceClause<T, C>.unfold(vararg props: KProperty<*>, maxDepth: Int = 0): DataFrame<T> =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, we should also use KCallable instead of KProperty for java classes support :)

fun `unfold properties`() {
val col by columnOf(A("1", 123, B(3.0)))
val df1 = dataFrameOf(col)
val conv = df1.replace { col }.unfold(maxDepth = 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifying maxDepth now breaks, while it worked before.
Try running df1.replace { col }.with { it.unfold() } before the PR and after it.
It works before, but now it gives: java.lang.UnsupportedOperationException: Can not get nested column 'd' from ValueColumn 'bb'

@Jolanrensen
Copy link
Collaborator

I'd say df.unfold should stay, because use case for simply unfolding a column with objects stays. Worth to add df.unfold(maxDepth = , roots = ) overload too, missed it.

So I'd either:
or make just df.unfold {} more powerful

Please write this API with needed overloads and a few examples of its usage then

I made a little sample of unfold {}.by {} with the same options as your version. Both df.unfold {} and df.unfold {}.by {} can be used. See the commit here for more details:
20cab7a

@koperagen
Copy link
Collaborator Author

koperagen commented Jul 1, 2024

UnfoldingDataFrame looks good, we can use it. Supporting it on the plugin side will require some changes, but it's ok. I'm only worried with return type being different than DataFrame people will be tempted to call df.unfold { col }.by(). It's first time intermediate object in a multiplex operation is also a DataFrame. Or the opposite problem: df.unfold { col }. will print all DataFrame API, with by being somewhere in completion list no different than let's say filter
But if we go this route, i'd also add
df.replace {}.by(CreateDataFrameDsl) (only this, without vararg props: KProperty<*> and maxDepth: Int = 0 overloads)

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jul 1, 2024

@koperagen Yes, I couldn't find another way to have a notation with 2 selectors and the second one being optional while keeping the DataFrame DSL style :/ but indeed, it's something new. A bit like .recursively().

Actually, since by is defined on UnfoldingDataFrame it should appear quite high on the list:
image

We could also put it inside the class to make it even more discoverable.

Alternatively, we could change the return-type of unfold to something non-dataframe-ish and let people call
df.unfold { a }.byReplacing() or something.

Imagine that XD df.replace { a }.byUnfolding() and df.unfold { a }.byReplacing() and it would do the same.

replace {}.by {} also looks interesting :)


@Test
fun `unfold properties`() {
val col by columnOf(A("1", 123, B(3.0)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case "More fine-grained toDataFrame. Instead of converting 20-30 properties to 2-3 level of nesting all at once user can choose to convert toDataFrame(maxDepth = 0) and unfold required properties to whatever level they need" covered here, in this test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, yes. I intend to have a more representative example as a part of compiler plugin demo. There's a tree of objects with many properties and potentially deep nesting from konsist library. It will be a good illustration. But here it merely unfolds one specific column up to 2 levels.

val a by columnOf("123")
val df = dataFrameOf(a)

val conv = df.replace { a }.unfold {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use replace and unfold independently? If somehow yes, could you please add test for this, of only together, could be combined to one function?

@zaleslaw
Copy link
Collaborator

zaleslaw commented Jul 2, 2024

Honetsly, I like the idea from the use-case "More fine-grained toDataFrame. Instead of converting 20-30 properties to 2-3 level of nesting all at once user can choose to convert toDataFrame(maxDepth = 0) and unfold required properties to whatever level they need", defining a level of our unfolding (that is a special situation of convert)

Also I found that we have a lack of docs/examples for this operation https://kotlin.github.io/dataframe/unfold.html

Some crazy ideas

replace {}.by { ::unfold }
replace {}.by { ::unfoldDeeply }

@Jolanrensen Jolanrensen marked this pull request as draft July 23, 2024 12:22
@Jolanrensen
Copy link
Collaborator

Since it's a WIP I made it a draft for now

@Jolanrensen Jolanrensen added the Compiler plugin Anything related to the DataFrame Compiler Plugin label Aug 8, 2024
@Jolanrensen
Copy link
Collaborator

Just a thought :) We can actually have something like replace by unfolding without changing the replace API. Replace already works like replace with, meaning we can already write something like df.replace { "data"<ColumnGroup<*>>() }.with { it.unfold() } in the current state of the library. Maybe we could expand on DataColumn.unfold to provide the Add-DSL like notation you suggest :)

Something like:

df.replace { data }.with { 
    it.unfold {
         "b" from { it }
         "c" from { DataRow.readJsonStr("""{"prop": 1}""") }
    }
}

@koperagen koperagen modified the milestones: 0.14.0, 0.15.0 Sep 17, 2024
@zaleslaw zaleslaw removed this from the 0.15.0 milestone Jan 6, 2025
@zaleslaw zaleslaw added this to the 0.16.0 milestone Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compiler plugin Anything related to the DataFrame Compiler Plugin enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants