Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative CSV reader #589

Closed
Jolanrensen opened this issue Feb 13, 2024 · 6 comments · Fixed by #903
Closed

Alternative CSV reader #589

Jolanrensen opened this issue Feb 13, 2024 · 6 comments · Fixed by #903
Assignees
Labels
csv CSV / delim related issues research This requires a deeper dive to gather a better understanding
Milestone

Comments

@Jolanrensen
Copy link
Collaborator

should be investigated: https://github.com/doyaaaaaken/kotlin-csv

@Jolanrensen Jolanrensen added the research This requires a deeper dive to gather a better understanding label Feb 13, 2024
@Jolanrensen Jolanrensen added this to the Backlog milestone Feb 13, 2024
@koperagen
Copy link
Collaborator

koperagen commented Feb 13, 2024

I tried FastCSV and want to utilize it on JVM for performance that several times better than existing one and beats pandas too
I assume you aim for KMP, so it's a different thing. Just a note to keep in mind

@devcrocod
Copy link
Contributor

Keep in mind that you can always write your own interface and hide the platform implementation later

@Jolanrensen Jolanrensen added the csv CSV / delim related issues label Aug 20, 2024
@Jolanrensen Jolanrensen mentioned this issue Aug 20, 2024
27 tasks
@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Sep 2, 2024

I've been experimenting with different implementations to find the fastest one in combination with DataFrame.

Each test has two versions of the implementation:

  • The default version first loads the entire CSV into memory. This is usually the fastest for smaller CSVs since the right amount of memory for the columns can be created right away. However, this can run into memory issues more quickly for larger CSV files.
  • That's why each test is accompanied with a "sequential" version. This version uses data collectors to stream the csv rows into separate string-columns directly. The downside of this is that we don't know the right amount of memory yet, so the ArrayLists need to grow accordingly, but we never get a full List<SomeCsvRowClass>, saving memory in the long run :)

We test:

Small CSV: 65.4 kB
(ops/s: Higher score is better)
image

(s/op: Lower score is better)
image

Large CSV: 857.7 MB
(ops/s: Higher score is better)
image

(s/op: Lower score is better)
image

@Jolanrensen
Copy link
Collaborator Author

I now added Deephaven-csv:

(s/op: Lower is better)

Benchmark                                    Mode  Cnt   Score    Error  Units
CsvBenchmark.apacheCsvReader                   ss   10   0.007 ±  0.003   s/op
CsvBenchmark.apacheCsvReaderSequential         ss   10   0.008 ±  0.003   s/op
CsvBenchmark.deephavenCsvReader                ss   10   0.009 ±  0.011   s/op
CsvBenchmark.fastCsvReader                     ss   10   0.004 ±  0.001   s/op
CsvBenchmark.fastCsvReaderSequential           ss   10   0.004 ±  0.002   s/op
CsvBenchmark.kotlinCsvReader                   ss   10   0.008 ±  0.001   s/op
CsvBenchmark.kotlinCsvReaderSequential         ss   10   0.007 ±  0.001   s/op
LargeCsvBenchmark.apacheCsvReader              ss    5  72.809 ± 16.879   s/op
LargeCsvBenchmark.apacheCsvReaderSequential    ss    5  46.433 ± 39.409   s/op
LargeCsvBenchmark.deephavenCsvReader           ss    5  16.640 ±  6.664   s/op
LargeCsvBenchmark.fastCsvReader                ss    5  59.848 ± 22.986   s/op
LargeCsvBenchmark.fastCsvReaderSequential      ss    5  40.747 ±  4.598   s/op
LargeCsvBenchmark.kotlinCsvReader              ss    5  80.383 ± 15.870   s/op
LargeCsvBenchmark.kotlinCsvReaderSequential    ss    5  68.547 ± 20.748   s/op

Note: The deephaven integration might not be optimal yet:

  • It can parse values by type itself, but I haven't figured out how to make custom parsers for it yet, so parsing a string column requires parsing twice (or more) at the moment.
  • deephaven allows defining your own (typed and unboxed) data collector which could give an immense boost in combination with Research: ColumnDataHolder/primitive arrays #712

@Jolanrensen Jolanrensen modified the milestones: Backlog, 0.15.0 Sep 4, 2024
@Jolanrensen
Copy link
Collaborator Author

Combining Deephaven with #712 is very promising.
Reading the large csv on the ColumnDataHolder branch with properly set-up deephaven reading yields the following results:
image
image

Doing the same on the master branch yields:
image
image

Both in terms of memory and performance, there's something to gain from using deephaven and primitive arrays, at least when it comes to reading csvs :)

@Jolanrensen
Copy link
Collaborator Author

Deephaven with normal arraylists (that support nulls this time) and new parsers:

image

@Jolanrensen Jolanrensen self-assigned this Sep 30, 2024
@Jolanrensen Jolanrensen mentioned this issue Nov 1, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
csv CSV / delim related issues research This requires a deeper dive to gather a better understanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants