-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register CSV, Feather, Excel and various stats formats #124
Conversation
Wait, why is this passing? I guess the failures in #123 might actually have been related? |
99c8c03
to
b5f7b29
Compare
@@ -16,6 +16,13 @@ end | |||
|
|||
add_format(format"RData", detect_rdata, [".rda", ".RData", ".rdata"], [:RData, LOAD]) | |||
|
|||
add_format(format"CSV", (), [".csv"], [:CSVFiles]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems controversial, there are many many different CSV reader options around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment, essentially the CSVFiles package exists because there are multiple potential target formats for a CSV file and there are multiple packages that do CSV reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you're making this depend on your specific interpretation of how to enable interoperability between them. FileIO has so far been about fairly straightforward mappings of file formats to individual loader packages, not file formats to another intermediate abstraction layer. And there are multiple, not universally better or worse, abstraction layers around.
Why aren't Feather, Stata, and Excel formats done directly by the Feather.jl, ReadStat.jl, and ExcelReaders.jl loader packages? There's a lot of excess IterableTables dependencies (especially problematic ones, like Requires.jl) in these new *Files packages that FileIO shouldn't need to load a file. FileIO should be unopinionated, load the file for you and leave manipulations on the data to other packages. |
That is what this design enables. There is clearly no one data structure for tabular data in julia, instead there are about a gazillion ones. So One trait that the In general, this enables all sorts of cool syntax: load("file.csv") |> DataFrame # Construct a DataFrame, currently using iterable tables, which uses TextParse.jl for parsing
load("file.csv") |> DataTable # If DataTables wanted to use CSV.jl instead for loading stuff, it could easily opt to do that
load("file.csv") |> @query(i, begin
@where some_condition
@select {i.a, i.b}
end) |> save("output.feather")
load("file.csv") |> plot() # This should work with VegaLite soon And it all mixes and matches as you wish.
Because they are concrete implementations of a specific load algorithm. But different consuming functions might want to use different packages for loading, depending on e.g. their internal data structure etc. Having this one extra step in-between makes it feasible to not hard code one load implementation per file, but make the choice of load routine depend on the file type and the target function. |
And here you're enforcing IterableTables, and therefore Requires, NamedTuples, and more to just load a file! That's way way overkill. |
Well, you can't load a file without some dependencies :) Pretty much any format here registered in FileIO brings in way more dependencies than this. Also, there is a clear path to get rid of all dependencies in IterableTables.jl, i.e. in the julia 1.0 time frame IterableTables will have no dependencies at all and consist of about 50 lines code total. So, mid-term this will impose IterableTables (a 50 lines of code, harmless package) as a dependency, but it will not force the whole stack over to use iterable tables, this design here is explicitly set up so that other approaches can co-exist. If someone has an idea for a less opinionated design, great, but I really don't see how this is overkill at all. |
You don't need all of TextParse, NullableArrays, PooledArrays, WeakRefStrings, IterableTables, NamedTuples, Requires, DataValues, DataTables, CategoricalArrays, StatsBase, SortingAlgorithms, Reexport, and DataStreams to load a csv file. Some subset sure, depending on what you want, but you don't need all of it, and that's what this does.
How so, when
You're going to find a lot of disagreement on that one. |
I have a plan how to get rid of the
You don't have to use it, you can also just do |
I read through the other issues here now, and I think this debate is misguided. One can register multiple packages for a single format (see here). Merging this PR here does not prevent any other package from also registering for the CSV file format with FileIO, in fact there seems specific support for that in the package. It seems an unresolved issue is how one can deal with a situation where two packages are installed that can handle the same file format (#46). In my mind the solution to that should be one of the suggestions made in #46. The solution should not be to put all file format registrations on hold for which there might be multiple packages out there that can handle it. Certainly is seems that in the past packages were registered for formats as they came up, so I don't see why this PR here should be treated differently. @SimonDanisch Is that a fair characterization of how things were handled here in the past? |
Yes, this is exactly the reason why FileIO exists: to not care about how heavy weight an IO package is! If anyone else ports their CSV reader to support FileIO, they should just add it to the registry as well. |
How are priorities handled? |
The order of the IO library list! |
add_format(format"Feather", (), [".feather"], [:FeatherFiles]) | ||
add_format(format"Excel", (), [".xls", ".xlsx"], [:ExcelFiles, LOAD]) | ||
add_format(format"Stata", (), [".dta"], [:StatFiles, LOAD]) | ||
add_format(format"SPSS", (), [".sav", ".por"], [:StatFiles, LOAD]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a magic number on this? .sav is very general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to find one.
Thanks, @SimonDanisch! Could you also tag a new release in METADATA? |
After fixing #129 ? |
Yes, all four packages have unit tests that exercise the FileIO stuff added here and it all works. But, on merging, I might take a stab at #122 soonish, so maybe we wait with a tag here until I either have done that or given up? |
let me know when you're ready for a tag! |
I'm about to register those four packages in METADATA.jl. There is a bit of a chicken and egg situation here: the tests in those four packages can only pass once this here is merged and tagged, but this here obviously only works if those four packages are registered.
Maybe the following would work: could this be merged onto
master
now, but not yet tagged? I'll then modify the CI builds for the four packages so that they run off the FileIO master branch. That should make those tests pass, then we can tag a new version of FileIO?@SimonDanisch & @timholy again, thanks for helping me out with this!