-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research: ColumnDataHolder
/primitive arrays
#712
base: master
Are you sure you want to change the base?
Conversation
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/columns/DataColumnImpl.kt
Outdated
Show resolved
Hide resolved
Very interesting idea and performance step forward, I suggest to start with a synthetic generated DataFrame with 1–10 columns with Ints, Longs, or something (better with the same type) and measure the average time/memory footprint of some performant operations before deep implementation. I see how we could economy on memory, but not sure about speed on operations. Also interesting to compare some non-default implementations as Multik or DirectByteBuffers Something like this import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.api.filter
import org.jetbrains.kotlinx.dataframe.api.groupby
import org.jetbrains.kotlinx.dataframe.api.sortBy
import org.openjdk.jmh.annotations.*
import java.util.concurrent.TimeUnit
@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
open class DataFrameBenchmark {
@Param("1", "2", "5", "10")
var columnCount: Int = 0
private lateinit var df: DataFrame<*>
@Setup(Level.Trial)
fun setup() {
df = createDataFrame(columnCount, 1000000)
}
private fun createDataFrame(columnCount: Int, rowCount: Int): DataFrame<*> {
val columns = (1..columnCount).map { "col$it" to DoubleArray(rowCount) { Math.random() } }
return dataFrameOf(*columns.toTypedArray())
}
@Benchmark
fun filter(): DataFrame<*> {
return df.filter { it["col1"] gt 0.5 }
}
@Benchmark
fun groupBy(): DataFrame<*> {
return df.groupBy("col1").mean()
}
@Benchmark
fun sortBy(): DataFrame<*> {
return df.sortBy("col1")
}
}
|
…is interface can define how the data of columns is stored. ColumnDataHolderImpl was created as default implementation and it defaults to store data in primitive arrays whenever possible
86101a8
to
41e6f05
Compare
I've been doing some small performance/size tests here and I've come to the following results: 50k rows
In terms of creation, the two fastest are Processing speed is interesting, the arrays are the slowest of the bunch. I assume this is because they are stored as primitive arrays and need to box all their values before being able to calculate anything with them. This can also be seen in the In terms of size, however, the world is flipped. All dataframes contain 3 columns of 50k rows. The ones that use primitive arrays to store their doubles hover around 11.5MB for the entire dataframe, while the others use about 14.5MB. This difference will likely increase when having more rows. Running the same tests on the master-branch we get:
similar results, that's good :) 100k rowsRunning the same tests again, but now with 100k rows:
and on the master:
We can see that the size gap increases the more rows you have. The processing time difference also changes, but not that much. 1M rowsFinally, running with 1M rows (and some fewer runs, so the results might be less accurate):
and on master:
It looks like the size difference becomes only more and more extreme, while the processing time difference is also there but less pronounced. I'm not sure what's the wisest way forward. 5M rows
Now primitive arrays are king! 10M rows
Eventually, primitive arrays are a must, increasing to 6M rows even crashes the null-containing columns due to OOM errors, while the primitive arrays are fine. |
I've updated
10M rows
|
…; to be stored efficiently
04bf223
to
1117bf5
Compare
The previous tests only contained a single I also experimented using a boolean array to store null-indices instead of int arrays. The results are the following: 10M rows, lots of nulls, boolean array
10M rows, lots of nulls, int array
As we can see, the performance differences are negligible but the difference in size is not. Using boolean arrays results in the same size for every case, using int arrays has the size dependent on the number of nulls. While the number of nulls in every day cases is quite low, I think we should prefer the latter. |
6cb80da
to
3692ba2
Compare
… Using fastutils classes for this.
3692ba2
to
1ceb69f
Compare
Generated sources will be updated after merging this PR. |
I've furthered my investigations. I wanted to see whether it would make a difference if the ColumnDataHolder could also be used to collect data that will make up new columns (so what (Typed)ColumnDataCollector does). I made a wrapper around all primitive arraylists (which, yes (un)boxes the values it works with, but stores them primitively at least) that can be used as a normal mutable list. If given no type, the first element entered will dictate it. Null values are not allowed. ColumnDataHolderImpl now takes this ColumnDataCollector now uses And now for the test results: 5M rows, mutable ColumnDataHolder, also in Column Data Collector
and on master:
Let's look at the sizes first, as that result is the easier to explain. Using In terms of size, storing the indices for lots of nulls and using a primitive array can actually be larger than simply keeping a boxed list: 372MB versus 250MB. I guess keeping nulls in a list is cheaper than keeping a set of indices. We'd need to find a sweet-spot here. |
63d5e0a
to
a2f0eb5
Compare
…ectly into primitive arrays
a2f0eb5
to
0d8d392
Compare
…ded set-overloads with tests to CDH too
Fixes #30, one of our oldest issues.
I introduced
ColumnDataHolder
to replace theList
inDataColumnImpl
. This interface can define how the data of columns is stored.ColumnDataHolderImpl
was created as default implementation and it defaults to store data in primitive arrays whenever possible. Other implementations might be possible in the future as well (to make DF act on top of an existing DB for instance).Things to be done:
ColumnDataHolder
s directly wherever possible instead ofList
s.DataColumnImpl.type
mismatchesDataColumnImpl.values
: ☂type: KType
inDataColumnImpl
mismatches actual values sometimes #713