Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with large files (~200 MB) #9

Open
Guillawme opened this issue Apr 29, 2021 · 3 comments
Open

Performance issue with large files (~200 MB) #9

Guillawme opened this issue Apr 29, 2021 · 3 comments

Comments

@Guillawme
Copy link

Hello,

As mentioned in #8, trying to read a STAR file about 200 MB in size hanged "forever" until I canceled the command (I waited a bit more than one hour). The Julia process doing this ate up to 14 GB of RAM (out of 16), and was still occupying more RAM (slowly) when I decided to cancel the command. This happened when I tried the commands below in a freshly opened Julia session (maybe I should have read a small file with similar structure first, to get compilation out of the way before trying to read the large file?).

julia> using CrystalInfoFramework, DataFrames, FilePaths
julia> test = Cif(p"particles.star")

Here is this particles.star file (link valid for 5 days): https://drop.chapril.org/download/311b4a22f7b03565/#X9xYmmEtcD4A4WZQKjbxvg

I can share even larger star files (up to ~800 MB) if you want to really stress test the package.

@jamesrhester
Copy link
Owner

I think the issue here is very large memory usage as the parse tree is being constructed. One fix would be to allow extracting information as soon as a syntactic item is matched, so that only the interesting items can be preserved and the rest thrown away instead of building a massive parse tree first. The package Lerche, which does the parsing, doesn't allow this yet. I'll raise an issue on that package.

@Guillawme
Copy link
Author

In case this helps, I noticed after reporting this issue that the BioStructures.jl package can also read mmCIF (and simple STAR) files, and that its readmultimmcif function only takes a few seconds to read the same large file. I have no idea how different or similar their parser is, though.

@jamesrhester
Copy link
Owner

Interesting. That parser works by splitting the file into whitespace-separated tokens (handling quoted strings), then working through these tokens to allocate them to data blocks and data names. A different paradigm to the general one used here and clearly super fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants