Research: `ColumnDataHolder`/primitive arrays #712

Jolanrensen · 2024-06-03T10:40:34Z

Fixes #30, one of our oldest issues.

I introduced ColumnDataHolder to replace the List in DataColumnImpl. This interface can define how the data of columns is stored.
ColumnDataHolderImpl was created as default implementation and it defaults to store data in primitive arrays whenever possible. Other implementations might be possible in the future as well (to make DF act on top of an existing DB for instance).

Things to be done:

Let data sources create ColumnDataHolders directly wherever possible instead of Lists.
Fix cases where DataColumnImpl.type mismatches DataColumnImpl.values: ☂ type: KType in DataColumnImpl mismatches actual values sometimes #713
Test performance/memory differences
Improve API

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/DataColumnImpl.kt

zaleslaw · 2024-06-04T10:55:34Z

Very interesting idea and performance step forward, I suggest to start with a synthetic generated DataFrame with 1–10 columns with Ints, Longs, or something (better with the same type) and measure the average time/memory footprint of some performant operations before deep implementation.

I see how we could economy on memory, but not sure about speed on operations.

Also interesting to compare some non-default implementations as Multik or DirectByteBuffers

Something like this

import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.api.filter
import org.jetbrains.kotlinx.dataframe.api.groupby
import org.jetbrains.kotlinx.dataframe.api.sortBy
import org.openjdk.jmh.annotations.*
import java.util.concurrent.TimeUnit

@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
open class DataFrameBenchmark {

    @Param("1", "2", "5", "10")
    var columnCount: Int = 0

    private lateinit var df: DataFrame<*>

    @Setup(Level.Trial)
    fun setup() {
        df = createDataFrame(columnCount, 1000000)
    }

    private fun createDataFrame(columnCount: Int, rowCount: Int): DataFrame<*> {
        val columns = (1..columnCount).map { "col$it" to DoubleArray(rowCount) { Math.random() } }
        return dataFrameOf(*columns.toTypedArray())
    }

    @Benchmark
    fun filter(): DataFrame<*> {
        return df.filter { it["col1"] gt 0.5 }
    }

    @Benchmark
    fun groupBy(): DataFrame<*> {
        return df.groupBy("col1").mean()
    }

    @Benchmark
    fun sortBy(): DataFrame<*> {
        return df.sortBy("col1")
    }
}

plugins {
    kotlin("jvm") version "---"
}

repositories {
    mavenCentral()
}

dependencies {
    implementation("org.jetbrains.kotlinx:kotlinx-dataframe:---")
    implementation("org.openjdk.jmh:jmh-core:---")
    annotationProcessor("org.openjdk.jmh:jmh-generator-annprocess:---")
    testImplementation(kotlin("test"))
}

./gradlew jmh

…is interface can define how the data of columns is stored. ColumnDataHolderImpl was created as default implementation and it defaults to store data in primitive arrays whenever possible

Jolanrensen · 2024-08-14T13:20:55Z

I've been doing some small performance/size tests here and I've come to the following results:

50k rows

⌌-------------------------------------------------------------------⌍
|  |                  type|   creation|  processing|            size|
|--|----------------------|-----------|------------|----------------|
| 0| BOXED_ARRAY_WITH_NULL| 1.668690ms| 40.072489ms| 14,500,481.813333|
| 1|        LIST_WITH_NULL| 9.142612ms| 41.064332ms| 14,509,001.813333|
| 2|                  LIST| 2.710987ms| 42.268814ms| 11,496,455.760000|
| 3|           BOXED_ARRAY| 2.415740ms| 42.270087ms| 11,502,541.520000|
| 4|          DOUBLE_ARRAY| 1.840757ms| 42.354001ms| 11,499,172.666667|
⌎-------------------------------------------------------------------⌏

In terms of creation, the two fastest are Array<Double?> and DoubleArray.
It makes sense Array<Double> and List<Double> are slower in this setup, as they're converted to DoubleArray in the ColumnDataHolder. Why List<Double?> is thát much slower, I honestly don't know, it's not converted to anything, just another list.

Processing speed is interesting, the arrays are the slowest of the bunch. I assume this is because they are stored as primitive arrays and need to box all their values before being able to calculate anything with them. This can also be seen in the null-containing columns, which are processed the quickest.

In terms of size, however, the world is flipped. All dataframes contain 3 columns of 50k rows. The ones that use primitive arrays to store their doubles hover around 11.5MB for the entire dataframe, while the others use about 14.5MB. This difference will likely increase when having more rows.
(Edit: yep, having 100k rows results in 12.7MB / 18.7MB, and 1M rows in 34.3MB / 94.3MB, a relatively larger difference)

Running the same tests on the master-branch we get:

⌌------------------------------------------------------------⌍
|  |           type|   creation|  processing|            size|
|--|---------------|-----------|------------|----------------|
| 0|           LIST| 2.610231ms| 49.266917ms| 14,092,495.866667|
| 1| LIST_WITH_NULL| 7.030812ms| 51.990542ms| 14,091,560.133333|
⌎------------------------------------------------------------⌏

100k rows

Running the same tests again, but now with 100k rows:

⌌-------------------------------------------------------------------⌍
|  |                  type|   creation|  processing|            size|
|--|----------------------|-----------|------------|----------------|
| 0|        LIST_WITH_NULL| 4.998340ms| 92.144092ms| 18,682,120.080000|
| 1| BOXED_ARRAY_WITH_NULL| 7.814503ms| 94.205205ms| 18,681,227.733333|
| 2|          DOUBLE_ARRAY| 2.578687ms| 95.036875ms| 12,682,198.026667|
| 3|                  LIST| 5.750163ms| 96.867539ms| 12,682,224.000000|
| 4|           BOXED_ARRAY| 5.091461ms| 99.180867ms| 12,682,224.000000|
⌎-------------------------------------------------------------------⌏

and on the master:

⌌---------------------------------------------------------⌍
|  |           type|    creation|  processing|        size|
|--|---------------|------------|------------|------------|
| 0|           LIST|  4.411068ms| 85.198946ms| 18,292,344.00|
| 1| LIST_WITH_NULL| 11.444154ms| 85.863901ms| 18,291,437.92|
⌎---------------------------------------------------------⌏

We can see that the size gap increases the more rows you have. The processing time difference also changes, but not that much.

1M rows

Finally, running with 1M rows (and some fewer runs, so the results might be less accurate):

⌌-----------------------------------------------------------------⌍
|  |                  type|     creation|   processing|       size|
|--|----------------------|-------------|-------------|-----------|
| 0|        LIST_WITH_NULL|  84.336402ms| 888.432970ms| 94,282,104.0|
| 1| BOXED_ARRAY_WITH_NULL|  73.450625ms| 900.390398ms| 94,286,628.0|
| 2|           BOXED_ARRAY|  65.732071ms| 925.527278ms| 34,282,184.0|
| 3|                  LIST|  84.736371ms| 932.813523ms| 34,282,184.0|
| 4|          DOUBLE_ARRAY| 174.996687ms| 1.020323820s| 34,255,380.8|
⌎-----------------------------------------------------------------⌏

and on master:

⌌--------------------------------------------------------⌍
|  |           type|     creation|   processing|     size|
|--|---------------|-------------|-------------|---------|
| 0|           LIST|  63.146449ms| 769.020269ms| 93,892,320|
| 1| LIST_WITH_NULL| 185.790897ms| 822.444663ms| 93,863,672|
⌎--------------------------------------------------------⌏

It looks like the size difference becomes only more and more extreme, while the processing time difference is also there but less pronounced. I'm not sure what's the wisest way forward.

5M rows

⌌------------------------------------------------------------------⌍
|  |                  type|     creation|   processing|        size|
|--|----------------------|-------------|-------------|------------|
| 0|          DOUBLE_ARRAY| 175.116960ms| 4.861449108s| 130,952,232.8|
| 1|        LIST_WITH_NULL| 732.860339ms| 4.905113139s| 431,018,416.8|
| 2|                  LIST| 658.093769ms| 4.995039048s| 130,916,814.4|
| 3| BOXED_ARRAY_WITH_NULL| 548.754027ms| 5.073622519s| 430,984,516.8|
| 4|           BOXED_ARRAY| 581.207400ms| 5.330943608s| 130,936,416.0|
⌎------------------------------------------------------------------⌏

Now primitive arrays are king!

10M rows

⌌---------------------------------------------------------⌍
|  |         type|     creation|   processing|        size|
|--|-------------|-------------|-------------|------------|
| 0| DOUBLE_ARRAY| 394.566783ms| 9.092918960s| 250,284,296.0|
| 1|         LIST| 1.024305912s| 9.254395530s| 250,284,256.8|
| 2|  BOXED_ARRAY|    1.065674s| 9.294058824s| 250,276,294.4|
⌎---------------------------------------------------------⌏

Eventually, primitive arrays are a must, increasing to 6M rows even crashes the null-containing columns due to OOM errors, while the primitive arrays are fine.
Still, we would need a way to store null-values, but that could probably be done in some clever way. However, this would decrease performance for smaller dataframes.

Jolanrensen · 2024-08-14T17:31:26Z

I've updated ColumnDataHolder in such a way that it can store a collection/array of nullable primitives:

nulls are filled with 0 (or the zero-equivalent of that type)
indices of where the nulls were are stored in an IntArray
the null-filled data is stored in a primitive array
ColumnDataHolder implements List and the right values are returned at the correct indices (needs some more tests)

10M rows

⌌-------------------------------------------------------------------⌍
|  |                  type|     creation|    processing|        size|
|--|----------------------|-------------|--------------|------------|
| 0|          DOUBLE_ARRAY| 452.407441ms| 10.661419664s| 250,303,579.2|
| 1| BOXED_ARRAY_WITH_NULL| 1.246602198s| 10.876937912s| 250,303,867.2|
| 2|                  LIST| 1.075708642s| 10.987466189s| 250,303,651.2|
| 3|           BOXED_ARRAY| 1.109656324s| 11.206449292s| 250,308,171.2|
| 4|        LIST_WITH_NULL| 1.878721075s| 11.211828024s| 250,294,786.4|
⌎-------------------------------------------------------------------⌏

…; to be stored efficiently

Jolanrensen · 2024-08-15T17:57:24Z

The previous tests only contained a single null value, which seems a bit wrong. I now have tests where half of the values are null in the BOXED_ARRAY_WITH_NULL and LIST_WITH_NULL tests.

I also experimented using a boolean array to store null-indices instead of int arrays. The results are the following:

10M rows, lots of nulls, boolean array

⌌-------------------------------------------------------------------⌍
|  |                  type|     creation|    processing|        size|
|--|----------------------|-------------|--------------|------------|
| 0|        LIST_WITH_NULL| 1.896135155s| 12.906753380s| 280,393,248.0|
| 1| BOXED_ARRAY_WITH_NULL| 1.622306469s| 13.093053168s| 280,320,763.2|
| 2|          DOUBLE_ARRAY| 535.327248ms| 13.494416201s| 280,330,497.6|
| 3|                  LIST| 1.395451763s| 13.647339781s| 280,372,962.4|
| 4|           BOXED_ARRAY| 1.240805238s| 14.096025326s| 280,339,035.2|
⌎-------------------------------------------------------------------⌏

10M rows, lots of nulls, int array

⌌-------------------------------------------------------------------⌍
|  |                  type|     creation|    processing|        size|
|--|----------------------|-------------|--------------|------------|
| 0|          DOUBLE_ARRAY| 472.084569ms| 13.341951593s| 250,313,040.0|
| 1|                  LIST| 1.395223809s| 13.447386786s| 250,312,961.6|
| 2| BOXED_ARRAY_WITH_NULL| 1.672050297s| 13.528234068s| 310,318,894.4|
| 3|           BOXED_ARRAY| 1.379209011s| 13.646054496s| 250,312,883.2|
| 4|        LIST_WITH_NULL| 2.950703003s| 14.230182141s| 310,293,660.8|
⌎-------------------------------------------------------------------⌏

As we can see, the performance differences are negligible but the difference in size is not.

Using boolean arrays results in the same size for every case, using int arrays has the size dependent on the number of nulls. While the number of nulls in every day cases is quite low, I think we should prefer the latter.

… Using fastutils classes for this.

github-actions · 2024-08-20T17:08:15Z

Generated sources will be updated after merging this PR.
Please inspect the changes in here.

Jolanrensen · 2024-08-20T17:25:04Z

I've furthered my investigations. I wanted to see whether it would make a difference if the ColumnDataHolder could also be used to collect data that will make up new columns (so what (Typed)ColumnDataCollector does).
This required migrating to the primitive ArrayLists of fastutil:

I made a wrapper around all primitive arraylists (which, yes (un)boxes the values it works with, but stores them primitively at least) that can be used as a normal mutable list. If given no type, the first element entered will dictate it. Null values are not allowed.

ColumnDataHolderImpl now takes this PrimitiveArrayList wrapper as its base list, but it can fall back to a normal ArrayList/MutableList when needed. The strictTypes argument can be set to true to prevent this fallback to happen, but usually it's allowed.
If the primitive arraylist is in use, nullIndices keeps track of the null values in the data and zeroValue will be put in the primitive arraylist instead.

ColumnDataCollector now uses ColumnDataHolder to collect its values. When the result of this is turned into a column, no conversion is necessary, since it's a ColumnDataHolder already :).

And now for the test results:

5M rows, mutable ColumnDataHolder, also in Column Data Collector

⌌------------------------------------------------------------------⌍
|  |                  type|     creation|   processing|        size|
|--|----------------------|-------------|-------------|------------|
| 0| BOXED_ARRAY_WITH_NULL| 2.954687548s| 2.743710069s| 372,773,032|
| 1|             COLLECTOR| 5.111297654s| 2.992183792s| 380,323,708|
| 2|        LIST_WITH_NULL| 3.941255533s| 3.159062938s| 372,454,737|
| 3|         NON_PRIMITIVE| 3.735666737s| 3.330158776s| 492,572,326|
| 4|          DOUBLE_ARRAY| 288.513615ms| 8.085407327s| 132,513,736|
| 5|                  LIST| 744.363739ms| 9.264420569s| 132,590,481|
| 6|           BOXED_ARRAY| 705.545507ms| 9.365372480s| 132,505,768|
⌎------------------------------------------------------------------⌏

and on master:

⌌-----------------------------------------------------------⌍
|  |           type|     creation|   processing|        size|
|--|---------------|-------------|-------------|------------|
| 0|      COLLECTOR| 2.370049848s| 1.794743904s| 254,129,245|
| 1| LIST_WITH_NULL| 780.404256ms| 2.071586752s| 250,365,329|
| 2|  NON_PRIMITIVE| 595.397324ms| 3.481924785s| 250,349,159|
| 3|           LIST| 483.394463ms| 8.160207220s| 430,403,480|
⌎-----------------------------------------------------------⌏

Let's look at the sizes first, as that result is the easier to explain. Using CDH, we get the smallest sizes for non-null values: 132MB versus 430MB. This does come with a performance hit. It seems like it's cheaper to query for nulls than to retrieve and unbox a value from a primitive array. Interesting... This is not a negative result however; it's simply heavier to retrieve actual values. We'd need another test to give a proper view of the performance impacts.

In terms of size, storing the indices for lots of nulls and using a primitive array can actually be larger than simply keeping a boxed list: 372MB versus 250MB. I guess keeping nulls in a list is cheaper than keeping a set of indices. We'd need to find a sweet-spot here.

…ectly into primitive arrays

…ded set-overloads with tests to CDH too

Jolanrensen added research This requires a deeper dive to gather a better understanding performance Something related to how fast the library can handle data labels Jun 3, 2024

Jolanrensen added this to the Backlog milestone Jun 3, 2024

Jolanrensen self-assigned this Jun 3, 2024

zaleslaw reviewed Jun 4, 2024

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/DataColumnImpl.kt Outdated Show resolved Hide resolved

Jolanrensen mentioned this pull request Jul 30, 2024

Small convertTo fix #800

Merged

Jolanrensen added 2 commits August 13, 2024 16:01

introduced ColumnDataHolder to replace the List in DataColumnImpl. Th…

2bff8b2

…is interface can define how the data of columns is stored. ColumnDataHolderImpl was created as default implementation and it defaults to store data in primitive arrays whenever possible

refactoring

41e6f05

Jolanrensen force-pushed the column-data-holder branch from 86101a8 to 41e6f05 Compare August 13, 2024 14:04

Jolanrensen added 4 commits August 13, 2024 16:14

small API change

1c5b11d

small API change

317f466

testing speed

6589165

speed testing

bfaaec8

ColumnDataHolder now implements List and supports nullable primitives…

1117bf5

…; to be stored efficiently

Jolanrensen force-pushed the column-data-holder branch from 04bf223 to 1117bf5 Compare August 14, 2024 17:53

testing with boolean array instead of ints, reported results

e53fb2a

Jolanrensen force-pushed the column-data-holder branch from 6cb80da to 3692ba2 Compare August 16, 2024 15:36

WIP making ColumnDataHolder mutable, so it can work as DataCollector.…

1ceb69f

… Using fastutils classes for this.

Jolanrensen force-pushed the column-data-holder branch from 3692ba2 to 1ceb69f Compare August 16, 2024 15:48

Jolanrensen force-pushed the column-data-holder branch from 63d5e0a to a2f0eb5 Compare August 21, 2024 14:43

ColumnDataCollector now uses ColumnDataHolder for collecting data dir…

0d8d392

…ectly into primitive arrays

Jolanrensen force-pushed the column-data-holder branch from a2f0eb5 to 0d8d392 Compare August 21, 2024 16:36

experimenting with BigLists

c1f618c

Jolanrensen mentioned this pull request Sep 2, 2024

Alternative CSV reader #589

Closed

Jolanrensen added 8 commits September 3, 2024 13:06

adding primitive overloads for PrimitiveArrayList

c576480

adding primitive overloads for DataCollector and ColumnDataHolder. Ad…

8ace414

…ded set-overloads with tests to CDH too

testing with deephaven csv

86a6b4e

allowing skipping parsers

c3f3f6a

improving write method

ef581bf

improving PrimitiveArrayList

8373937

improving deephaven array copying when writing

8ee8344

added benchmarking for deephaven csv

9cd673a

Jolanrensen mentioned this pull request Oct 17, 2024

Add support for reading parquet file thanks to arrow-dataset #576 #577

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: `ColumnDataHolder`/primitive arrays #712

Research: `ColumnDataHolder`/primitive arrays #712

Jolanrensen commented Jun 3, 2024 •

edited

Loading

zaleslaw commented Jun 4, 2024 •

edited by Jolanrensen

Loading

Jolanrensen commented Aug 14, 2024 •

edited

Loading

Jolanrensen commented Aug 14, 2024

Jolanrensen commented Aug 15, 2024

github-actions bot commented Aug 20, 2024

Jolanrensen commented Aug 20, 2024 •

edited

Loading

Research: ColumnDataHolder/primitive arrays #712

Are you sure you want to change the base?

Research: ColumnDataHolder/primitive arrays #712

Conversation

Jolanrensen commented Jun 3, 2024 • edited Loading

zaleslaw commented Jun 4, 2024 • edited by Jolanrensen Loading

Jolanrensen commented Aug 14, 2024 • edited Loading

50k rows

100k rows

1M rows

5M rows

10M rows

Jolanrensen commented Aug 14, 2024

10M rows

Jolanrensen commented Aug 15, 2024

10M rows, lots of nulls, boolean array

10M rows, lots of nulls, int array

github-actions bot commented Aug 20, 2024

Jolanrensen commented Aug 20, 2024 • edited Loading

5M rows, mutable ColumnDataHolder, also in Column Data Collector

Research: `ColumnDataHolder`/primitive arrays #712

Research: `ColumnDataHolder`/primitive arrays #712

Jolanrensen commented Jun 3, 2024 •

edited

Loading

zaleslaw commented Jun 4, 2024 •

edited by Jolanrensen

Loading

Jolanrensen commented Aug 14, 2024 •

edited

Loading

Jolanrensen commented Aug 20, 2024 •

edited

Loading