Performance issue with large files (~200 MB) #9

Guillawme · 2021-04-29T13:18:03Z

Hello,

As mentioned in #8, trying to read a STAR file about 200 MB in size hanged "forever" until I canceled the command (I waited a bit more than one hour). The Julia process doing this ate up to 14 GB of RAM (out of 16), and was still occupying more RAM (slowly) when I decided to cancel the command. This happened when I tried the commands below in a freshly opened Julia session (maybe I should have read a small file with similar structure first, to get compilation out of the way before trying to read the large file?).

julia> using CrystalInfoFramework, DataFrames, FilePaths
julia> test = Cif(p"particles.star")

Here is this particles.star file (link valid for 5 days): https://drop.chapril.org/download/311b4a22f7b03565/#X9xYmmEtcD4A4WZQKjbxvg

I can share even larger star files (up to ~800 MB) if you want to really stress test the package.

The text was updated successfully, but these errors were encountered:

jamesrhester · 2021-05-04T02:33:52Z

I think the issue here is very large memory usage as the parse tree is being constructed. One fix would be to allow extracting information as soon as a syntactic item is matched, so that only the interesting items can be preserved and the rest thrown away instead of building a massive parse tree first. The package Lerche, which does the parsing, doesn't allow this yet. I'll raise an issue on that package.

Guillawme · 2021-05-04T09:56:13Z

In case this helps, I noticed after reporting this issue that the BioStructures.jl package can also read mmCIF (and simple STAR) files, and that its readmultimmcif function only takes a few seconds to read the same large file. I have no idea how different or similar their parser is, though.

jamesrhester · 2021-05-05T00:18:10Z

Interesting. That parser works by splitting the file into whitespace-separated tokens (handling quoted strings), then working through these tokens to allocate them to data blocks and data names. A different paradigm to the general one used here and clearly super fast.

Guillawme mentioned this issue Apr 29, 2021

Naming and architecture suggestions #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with large files (~200 MB) #9

Performance issue with large files (~200 MB) #9

Guillawme commented Apr 29, 2021

jamesrhester commented May 4, 2021

Guillawme commented May 4, 2021

jamesrhester commented May 5, 2021

Performance issue with large files (~200 MB) #9

Performance issue with large files (~200 MB) #9

Comments

Guillawme commented Apr 29, 2021

jamesrhester commented May 4, 2021

Guillawme commented May 4, 2021

jamesrhester commented May 5, 2021