Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research: ColumnDataHolder/primitive arrays #712

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Jun 3, 2024

Fixes #30, one of our oldest issues.

I introduced ColumnDataHolder to replace the List in DataColumnImpl. This interface can define how the data of columns is stored.
ColumnDataHolderImpl was created as default implementation and it defaults to store data in primitive arrays whenever possible. Other implementations might be possible in the future as well (to make DF act on top of an existing DB for instance).

Things to be done:

@Jolanrensen Jolanrensen added research This requires a deeper dive to gather a better understanding performance Something related to how fast the library can handle data labels Jun 3, 2024
@Jolanrensen Jolanrensen added this to the Backlog milestone Jun 3, 2024
@Jolanrensen Jolanrensen self-assigned this Jun 3, 2024
@zaleslaw
Copy link
Collaborator

zaleslaw commented Jun 4, 2024

Very interesting idea and performance step forward, I suggest to start with a synthetic generated DataFrame with 1–10 columns with Ints, Longs, or something (better with the same type) and measure the average time/memory footprint of some performant operations before deep implementation.

I see how we could economy on memory, but not sure about speed on operations.

Also interesting to compare some non-default implementations as Multik or DirectByteBuffers

Something like this

import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.api.filter
import org.jetbrains.kotlinx.dataframe.api.groupby
import org.jetbrains.kotlinx.dataframe.api.sortBy
import org.openjdk.jmh.annotations.*
import java.util.concurrent.TimeUnit

@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
open class DataFrameBenchmark {

    @Param("1", "2", "5", "10")
    var columnCount: Int = 0

    private lateinit var df: DataFrame<*>

    @Setup(Level.Trial)
    fun setup() {
        df = createDataFrame(columnCount, 1000000)
    }

    private fun createDataFrame(columnCount: Int, rowCount: Int): DataFrame<*> {
        val columns = (1..columnCount).map { "col$it" to DoubleArray(rowCount) { Math.random() } }
        return dataFrameOf(*columns.toTypedArray())
    }

    @Benchmark
    fun filter(): DataFrame<*> {
        return df.filter { it["col1"] gt 0.5 }
    }

    @Benchmark
    fun groupBy(): DataFrame<*> {
        return df.groupBy("col1").mean()
    }

    @Benchmark
    fun sortBy(): DataFrame<*> {
        return df.sortBy("col1")
    }
}
plugins {
    kotlin("jvm") version "---"
}

repositories {
    mavenCentral()
}

dependencies {
    implementation("org.jetbrains.kotlinx:kotlinx-dataframe:---")
    implementation("org.openjdk.jmh:jmh-core:---")
    annotationProcessor("org.openjdk.jmh:jmh-generator-annprocess:---")
    testImplementation(kotlin("test"))
}

./gradlew jmh

@Jolanrensen Jolanrensen mentioned this pull request Jul 30, 2024
…is interface can define how the data of columns is stored. ColumnDataHolderImpl was created as default implementation and it defaults to store data in primitive arrays whenever possible
@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Aug 14, 2024

I've been doing some small performance/size tests here and I've come to the following results:

50k rows

⌌-------------------------------------------------------------------⌍
|  |                  type|   creation|  processing|            size|
|--|----------------------|-----------|------------|----------------|
| 0| BOXED_ARRAY_WITH_NULL| 1.668690ms| 40.072489ms| 14,500,481.813333|
| 1|        LIST_WITH_NULL| 9.142612ms| 41.064332ms| 14,509,001.813333|
| 2|                  LIST| 2.710987ms| 42.268814ms| 11,496,455.760000|
| 3|           BOXED_ARRAY| 2.415740ms| 42.270087ms| 11,502,541.520000|
| 4|          DOUBLE_ARRAY| 1.840757ms| 42.354001ms| 11,499,172.666667|
⌎-------------------------------------------------------------------⌏

In terms of creation, the two fastest are Array<Double?> and DoubleArray.
It makes sense Array<Double> and List<Double> are slower in this setup, as they're converted to DoubleArray in the ColumnDataHolder. Why List<Double?> is thát much slower, I honestly don't know, it's not converted to anything, just another list.

Processing speed is interesting, the arrays are the slowest of the bunch. I assume this is because they are stored as primitive arrays and need to box all their values before being able to calculate anything with them. This can also be seen in the null-containing columns, which are processed the quickest.

In terms of size, however, the world is flipped. All dataframes contain 3 columns of 50k rows. The ones that use primitive arrays to store their doubles hover around 11.5MB for the entire dataframe, while the others use about 14.5MB. This difference will likely increase when having more rows.
(Edit: yep, having 100k rows results in 12.7MB / 18.7MB, and 1M rows in 34.3MB / 94.3MB, a relatively larger difference)

Running the same tests on the master-branch we get:

⌌------------------------------------------------------------⌍
|  |           type|   creation|  processing|            size|
|--|---------------|-----------|------------|----------------|
| 0|           LIST| 2.610231ms| 49.266917ms| 14,092,495.866667|
| 1| LIST_WITH_NULL| 7.030812ms| 51.990542ms| 14,091,560.133333|
⌎------------------------------------------------------------⌏

similar results, that's good :)

100k rows

Running the same tests again, but now with 100k rows:

⌌-------------------------------------------------------------------⌍
|  |                  type|   creation|  processing|            size|
|--|----------------------|-----------|------------|----------------|
| 0|        LIST_WITH_NULL| 4.998340ms| 92.144092ms| 18,682,120.080000|
| 1| BOXED_ARRAY_WITH_NULL| 7.814503ms| 94.205205ms| 18,681,227.733333|
| 2|          DOUBLE_ARRAY| 2.578687ms| 95.036875ms| 12,682,198.026667|
| 3|                  LIST| 5.750163ms| 96.867539ms| 12,682,224.000000|
| 4|           BOXED_ARRAY| 5.091461ms| 99.180867ms| 12,682,224.000000|
⌎-------------------------------------------------------------------⌏

and on the master:

⌌---------------------------------------------------------⌍
|  |           type|    creation|  processing|        size|
|--|---------------|------------|------------|------------|
| 0|           LIST|  4.411068ms| 85.198946ms| 18,292,344.00|
| 1| LIST_WITH_NULL| 11.444154ms| 85.863901ms| 18,291,437.92|
⌎---------------------------------------------------------⌏

We can see that the size gap increases the more rows you have. The processing time difference also changes, but not that much.

1M rows

Finally, running with 1M rows (and some fewer runs, so the results might be less accurate):

⌌-----------------------------------------------------------------⌍
|  |                  type|     creation|   processing|       size|
|--|----------------------|-------------|-------------|-----------|
| 0|        LIST_WITH_NULL|  84.336402ms| 888.432970ms| 94,282,104.0|
| 1| BOXED_ARRAY_WITH_NULL|  73.450625ms| 900.390398ms| 94,286,628.0|
| 2|           BOXED_ARRAY|  65.732071ms| 925.527278ms| 34,282,184.0|
| 3|                  LIST|  84.736371ms| 932.813523ms| 34,282,184.0|
| 4|          DOUBLE_ARRAY| 174.996687ms| 1.020323820s| 34,255,380.8|
⌎-----------------------------------------------------------------⌏

and on master:

⌌--------------------------------------------------------⌍
|  |           type|     creation|   processing|     size|
|--|---------------|-------------|-------------|---------|
| 0|           LIST|  63.146449ms| 769.020269ms| 93,892,320|
| 1| LIST_WITH_NULL| 185.790897ms| 822.444663ms| 93,863,672|
⌎--------------------------------------------------------⌏

It looks like the size difference becomes only more and more extreme, while the processing time difference is also there but less pronounced. I'm not sure what's the wisest way forward.

5M rows

⌌------------------------------------------------------------------⌍
|  |                  type|     creation|   processing|        size|
|--|----------------------|-------------|-------------|------------|
| 0|          DOUBLE_ARRAY| 175.116960ms| 4.861449108s| 130,952,232.8|
| 1|        LIST_WITH_NULL| 732.860339ms| 4.905113139s| 431,018,416.8|
| 2|                  LIST| 658.093769ms| 4.995039048s| 130,916,814.4|
| 3| BOXED_ARRAY_WITH_NULL| 548.754027ms| 5.073622519s| 430,984,516.8|
| 4|           BOXED_ARRAY| 581.207400ms| 5.330943608s| 130,936,416.0|
⌎------------------------------------------------------------------⌏

Now primitive arrays are king!

10M rows

⌌---------------------------------------------------------⌍
|  |         type|     creation|   processing|        size|
|--|-------------|-------------|-------------|------------|
| 0| DOUBLE_ARRAY| 394.566783ms| 9.092918960s| 250,284,296.0|
| 1|         LIST| 1.024305912s| 9.254395530s| 250,284,256.8|
| 2|  BOXED_ARRAY|    1.065674s| 9.294058824s| 250,276,294.4|
⌎---------------------------------------------------------⌏

Eventually, primitive arrays are a must, increasing to 6M rows even crashes the null-containing columns due to OOM errors, while the primitive arrays are fine.
Still, we would need a way to store null-values, but that could probably be done in some clever way. However, this would decrease performance for smaller dataframes.

@Jolanrensen
Copy link
Collaborator Author

I've updated ColumnDataHolder in such a way that it can store a collection/array of nullable primitives:

  • nulls are filled with 0 (or the zero-equivalent of that type)
  • indices of where the nulls were are stored in an IntArray
  • the null-filled data is stored in a primitive array
  • ColumnDataHolder implements List and the right values are returned at the correct indices (needs some more tests)

10M rows

⌌-------------------------------------------------------------------⌍
|  |                  type|     creation|    processing|        size|
|--|----------------------|-------------|--------------|------------|
| 0|          DOUBLE_ARRAY| 452.407441ms| 10.661419664s| 250,303,579.2|
| 1| BOXED_ARRAY_WITH_NULL| 1.246602198s| 10.876937912s| 250,303,867.2|
| 2|                  LIST| 1.075708642s| 10.987466189s| 250,303,651.2|
| 3|           BOXED_ARRAY| 1.109656324s| 11.206449292s| 250,308,171.2|
| 4|        LIST_WITH_NULL| 1.878721075s| 11.211828024s| 250,294,786.4|
⌎-------------------------------------------------------------------⌏

@Jolanrensen
Copy link
Collaborator Author

The previous tests only contained a single null value, which seems a bit wrong. I now have tests where half of the values are null in the BOXED_ARRAY_WITH_NULL and LIST_WITH_NULL tests.

I also experimented using a boolean array to store null-indices instead of int arrays. The results are the following:

10M rows, lots of nulls, boolean array

⌌-------------------------------------------------------------------⌍
|  |                  type|     creation|    processing|        size|
|--|----------------------|-------------|--------------|------------|
| 0|        LIST_WITH_NULL| 1.896135155s| 12.906753380s| 280,393,248.0|
| 1| BOXED_ARRAY_WITH_NULL| 1.622306469s| 13.093053168s| 280,320,763.2|
| 2|          DOUBLE_ARRAY| 535.327248ms| 13.494416201s| 280,330,497.6|
| 3|                  LIST| 1.395451763s| 13.647339781s| 280,372,962.4|
| 4|           BOXED_ARRAY| 1.240805238s| 14.096025326s| 280,339,035.2|
⌎-------------------------------------------------------------------⌏

10M rows, lots of nulls, int array

⌌-------------------------------------------------------------------⌍
|  |                  type|     creation|    processing|        size|
|--|----------------------|-------------|--------------|------------|
| 0|          DOUBLE_ARRAY| 472.084569ms| 13.341951593s| 250,313,040.0|
| 1|                  LIST| 1.395223809s| 13.447386786s| 250,312,961.6|
| 2| BOXED_ARRAY_WITH_NULL| 1.672050297s| 13.528234068s| 310,318,894.4|
| 3|           BOXED_ARRAY| 1.379209011s| 13.646054496s| 250,312,883.2|
| 4|        LIST_WITH_NULL| 2.950703003s| 14.230182141s| 310,293,660.8|
⌎-------------------------------------------------------------------⌏

As we can see, the performance differences are negligible but the difference in size is not.

Using boolean arrays results in the same size for every case, using int arrays has the size dependent on the number of nulls. While the number of nulls in every day cases is quite low, I think we should prefer the latter.

Copy link
Contributor

Generated sources will be updated after merging this PR.
Please inspect the changes in here.

@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Aug 20, 2024

I've furthered my investigations. I wanted to see whether it would make a difference if the ColumnDataHolder could also be used to collect data that will make up new columns (so what (Typed)ColumnDataCollector does).
This required migrating to the primitive ArrayLists of fastutil:

I made a wrapper around all primitive arraylists (which, yes (un)boxes the values it works with, but stores them primitively at least) that can be used as a normal mutable list. If given no type, the first element entered will dictate it. Null values are not allowed.

ColumnDataHolderImpl now takes this PrimitiveArrayList wrapper as its base list, but it can fall back to a normal ArrayList/MutableList when needed. The strictTypes argument can be set to true to prevent this fallback to happen, but usually it's allowed.
If the primitive arraylist is in use, nullIndices keeps track of the null values in the data and zeroValue will be put in the primitive arraylist instead.

ColumnDataCollector now uses ColumnDataHolder to collect its values. When the result of this is turned into a column, no conversion is necessary, since it's a ColumnDataHolder already :).

And now for the test results:

5M rows, mutable ColumnDataHolder, also in Column Data Collector

⌌------------------------------------------------------------------⌍
|  |                  type|     creation|   processing|        size|
|--|----------------------|-------------|-------------|------------|
| 0| BOXED_ARRAY_WITH_NULL| 2.954687548s| 2.743710069s| 372,773,032|
| 1|             COLLECTOR| 5.111297654s| 2.992183792s| 380,323,708|
| 2|        LIST_WITH_NULL| 3.941255533s| 3.159062938s| 372,454,737|
| 3|         NON_PRIMITIVE| 3.735666737s| 3.330158776s| 492,572,326|
| 4|          DOUBLE_ARRAY| 288.513615ms| 8.085407327s| 132,513,736|
| 5|                  LIST| 744.363739ms| 9.264420569s| 132,590,481|
| 6|           BOXED_ARRAY| 705.545507ms| 9.365372480s| 132,505,768|
⌎------------------------------------------------------------------⌏

and on master:

⌌-----------------------------------------------------------⌍
|  |           type|     creation|   processing|        size|
|--|---------------|-------------|-------------|------------|
| 0|      COLLECTOR| 2.370049848s| 1.794743904s| 254,129,245|
| 1| LIST_WITH_NULL| 780.404256ms| 2.071586752s| 250,365,329|
| 2|  NON_PRIMITIVE| 595.397324ms| 3.481924785s| 250,349,159|
| 3|           LIST| 483.394463ms| 8.160207220s| 430,403,480|
⌎-----------------------------------------------------------⌏

Let's look at the sizes first, as that result is the easier to explain. Using CDH, we get the smallest sizes for non-null values: 132MB versus 430MB. This does come with a performance hit. It seems like it's cheaper to query for nulls than to retrieve and unbox a value from a primitive array. Interesting... This is not a negative result however; it's simply heavier to retrieve actual values. We'd need another test to give a proper view of the performance impacts.

In terms of size, storing the indices for lots of nulls and using a primitive array can actually be larger than simply keeping a boxed list: 372MB versus 250MB. I guess keeping nulls in a list is cheaper than keeping a set of indices. We'd need to find a sweet-spot here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something related to how fast the library can handle data research This requires a deeper dive to gather a better understanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add primitive arrays column wrappers
2 participants