-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naming and architecture suggestions #8
Comments
I've deliberately focused on CIF as that is widely used and recognised in both chemical crystallography and macromolecular crystallography. The package also includes a lot of CIF dictionary infrastructure. The RELION conventions look like they would be completely compatible with CIF, so a Can you provide a link to the 200MB file. Currently native speed is about 1MB/2s so a 200Mb file should be 400s, or 7 minutes. There may be a bottleneck in parsing that I can fix quickly. It may also be a memory consumption issue. Columns can only be typed if there is some knowledge of what their contents should be. CIF uses CIF dictionaries, so if you do associate a CIF dictionary to a CIF data block, correctly-typed columns are returned. In the case of STAR, there is no foolproof way of determining column contents (strings don't have to be delimited, I think) so you probably as a user need to indicate what types you want for columns and so it's probably just as simple to use DataFrame methods for this. |
I opened a dedicated issue and posted the large file with more info there. See #9. I mentioned typed columns because this is something |
Suggested enhancements:
|
Hello,
Since the README says
I suggest checking how much of the STAR format specification is supported by the package, and maybe renaming it to STAR.jl if enough of the specification is supported already and/or if complete (or "complete enough for most use cases") support is within the scope of this package. It seems to me that it is already compatible with STAR files following these conventions from RELION (I tried reading small files, it worked; trying to read a ~200 MB file with the native parser hanged for more than one hour until I cancelled the command; I can open a dedicated issue for that and provide the large file, if you want).
One other nice thing to have, which may not even need changing the architecture, would be a high-level function with which one could do something like
read_star("/path/to/file.star")
and get out a list ofDataFrame
s (one per table present in the file) with properly typed columns (most of the timeString
,Int64
orFloat64
; not sure anything else is ever present in these files). And also the complementarywrite_star(df::DataFrame, "/path/to/file.star")
(single table) andwrite_star(dfs::Vector{DataFrame}, "/path/to/file.star")
(multiple tables).The text was updated successfully, but these errors were encountered: