Register CSV, Feather, Excel and various stats formats #124

davidanthoff · 2017-06-08T21:23:43Z

I'm about to register those four packages in METADATA.jl. There is a bit of a chicken and egg situation here: the tests in those four packages can only pass once this here is merged and tagged, but this here obviously only works if those four packages are registered.

Maybe the following would work: could this be merged onto master now, but not yet tagged? I'll then modify the CI builds for the four packages so that they run off the FileIO master branch. That should make those tests pass, then we can tag a new version of FileIO?

@SimonDanisch & @timholy again, thanks for helping me out with this!

SimonDanisch · 2017-06-08T22:53:44Z

Wait, why is this passing? I guess the failures in #123 might actually have been related?

davidanthoff · 2017-06-08T22:57:31Z

Uh, that is very suspicions... This PR here doesn't have the changes in #123... So maybe #123 actually did introduce bugs?? I'll double check.

tkelman · 2017-06-09T22:44:38Z

src/registry.jl

@@ -16,6 +16,13 @@ end

 add_format(format"RData", detect_rdata, [".rda", ".RData", ".rdata"], [:RData, LOAD])

+add_format(format"CSV", (), [".csv"], [:CSVFiles])


This seems controversial, there are many many different CSV reader options around

See my other comment, essentially the CSVFiles package exists because there are multiple potential target formats for a CSV file and there are multiple packages that do CSV reading.

And you're making this depend on your specific interpretation of how to enable interoperability between them. FileIO has so far been about fairly straightforward mappings of file formats to individual loader packages, not file formats to another intermediate abstraction layer. And there are multiple, not universally better or worse, abstraction layers around.

tkelman · 2017-06-09T22:49:08Z

Why aren't Feather, Stata, and Excel formats done directly by the Feather.jl, ReadStat.jl, and ExcelReaders.jl loader packages? There's a lot of excess IterableTables dependencies (especially problematic ones, like Requires.jl) in these new *Files packages that FileIO shouldn't need to load a file. FileIO should be unopinionated, load the file for you and leave manipulations on the data to other packages.

davidanthoff · 2017-06-09T23:20:38Z

FileIO should be unopinionated, load the file for you and leave manipulations on the data to other packages.

That is what this design enables. There is clearly no one data structure for tabular data in julia, instead there are about a gazillion ones. So load("somefile.csv") can't return a concrete, loaded representation of the CSV content without being opinionated. This design takes a different approach: load("somefile.csv") returns an instance of CSVFile, which as a data structure just holds the filename of the corresponding CSV file and some config options specific to CSV reading. Data will only be loaded lazily once you pass this CSVFile to some other function, and how exactly the data is then loaded depends on what the consuming function wants to do.

One trait that the CSVFile implements currently is the iterable table trait, i.e. you can iterate through the tabular data of the file. This opens up compat with all the sinks implemented in IterableTables.jl right now (11 as of right now). But this is not exclusive, any consuming function can also decide to use a completely different load routine. and CSVFile could easily also implement other traits that allow tabular data extraction. So this design does not hard code iterable tables as the only option for these tabular files loaded via FileIO, but it does provide it as one of potentially many options.

In general, this enables all sorts of cool syntax:

load("file.csv") |> DataFrame # Construct a DataFrame, currently using iterable tables, which uses TextParse.jl for parsing
load("file.csv") |> DataTable # If DataTables wanted to use CSV.jl instead for loading stuff, it could easily opt to do that
load("file.csv") |> @query(i, begin
    @where some_condition
    @select {i.a, i.b}
end) |> save("output.feather")
load("file.csv") |> plot() # This should work with VegaLite soon

And it all mixes and matches as you wish.

Why aren't Feather and Stata formats done directly by Feather.jl and ReadStat.jl?

Because they are concrete implementations of a specific load algorithm. But different consuming functions might want to use different packages for loading, depending on e.g. their internal data structure etc. Having this one extra step in-between makes it feasible to not hard code one load implementation per file, but make the choice of load routine depend on the file type and the target function.

tkelman · 2017-06-09T23:32:32Z

And here you're enforcing IterableTables, and therefore Requires, NamedTuples, and more to just load a file! That's way way overkill.

davidanthoff · 2017-06-09T23:53:26Z

And here you're enforcing IterableTables, and therefore Requires, NamedTuples, and more to just load a file! That's way way overkill.

Well, you can't load a file without some dependencies :) Pretty much any format here registered in FileIO brings in way more dependencies than this. Also, there is a clear path to get rid of all dependencies in IterableTables.jl, i.e. in the julia 1.0 time frame IterableTables will have no dependencies at all and consist of about 50 lines code total.

So, mid-term this will impose IterableTables (a 50 lines of code, harmless package) as a dependency, but it will not force the whole stack over to use iterable tables, this design here is explicitly set up so that other approaches can co-exist.

If someone has an idea for a less opinionated design, great, but I really don't see how this is overkill at all.

tkelman · 2017-06-10T00:19:49Z

Well, you can't load a file without some dependencies

You don't need all of TextParse, NullableArrays, PooledArrays, WeakRefStrings, IterableTables, NamedTuples, Requires, DataValues, DataTables, CategoricalArrays, StatsBase, SortingAlgorithms, Reexport, and DataStreams to load a csv file. Some subset sure, depending on what you want, but you don't need all of it, and that's what this does.

(a 50 lines of code, harmless package) as a dependency, but it will not force the whole stack over to use iterable tables, this design here is explicitly set up so that other approaches can co-exist.

How so, when load("foo.csv") now dispatches to your specific stack of how formats should be implemented and converted?

cool syntax

You're going to find a lot of disagreement on that one.

davidanthoff · 2017-06-10T00:42:57Z

You don't need all of TextParse, NullableArrays, PooledArrays, WeakRefStrings, IterableTables, NamedTuples, Requires, DataValues, DataTables, CategoricalArrays, StatsBase, SortingAlgorithms, Reexport, and DataStreams to load a csv file. Some subset sure, depending on what you want, but you don't need all of it, and that's what this does.

I have a plan how to get rid of the DataTables dependency, using that is more of a shortcut right now because I want to have something ready to show for juliacon. Dropping that should get rid of a whole bunch of these dependencies (CategoricalArrays, StatsBase, SortingAlgorithms, Reexport and DataStreams). TextParse currently brings in NullableArrays (but might drop that dependency), PooledArrays and WeakRefStrings, so nothing I can do about that. NamedTuples and Requires will go away over time. So this is on a path where the requirements in CSVFiles will be whatever TextParse brings along, DataValues (but if someone finds a solution to the problems I outlined with the Union{T,Null} approach, this might also go away) and then IterableTables (with no dependencies).

How so, when load("foo.csv") now dispatches to your specific stack of how formats should be implemented and converted?

load("foo.csv") returns a CSVFile type. You can do whatever you want with that, if you have a function foo(a::CSVFile) that doesn't want to use the iterable tables approach, you can do that, no problem. You get the filename and the csv parsing options in that type, and then you can do whatever you want. In that case, not a single line of iterable tables code will ever run.

cool syntax

You're going to find a lot of disagreement on that one.

You don't have to use it, you can also just do DataFrame(load("file.csv")), plot(load("file.csv")) or save("file.feather", load("file.csv")) or any other combination. I do believe that having a piping like, dplyr like thing for julia is something that a critical mass of people would like to see, though. That is where all of this is going.

davidanthoff · 2017-06-10T04:25:16Z

I read through the other issues here now, and I think this debate is misguided. One can register multiple packages for a single format (see here). Merging this PR here does not prevent any other package from also registering for the CSV file format with FileIO, in fact there seems specific support for that in the package.

It seems an unresolved issue is how one can deal with a situation where two packages are installed that can handle the same file format (#46). In my mind the solution to that should be one of the suggestions made in #46. The solution should not be to put all file format registrations on hold for which there might be multiple packages out there that can handle it. Certainly is seems that in the past packages were registered for formats as they came up, so I don't see why this PR here should be treated differently.

@SimonDanisch Is that a fair characterization of how things were handled here in the past?

SimonDanisch · 2017-06-10T12:07:33Z

Yes, this is exactly the reason why FileIO exists: to not care about how heavy weight an IO package is! If anyone else ports their CSV reader to support FileIO, they should just add it to the registry as well.
We can then discuss which one is the best and which one should take priority (FileIO supports preferring one IO package as the standard loader).

tkelman · 2017-06-10T12:22:02Z

How are priorities handled?

SimonDanisch · 2017-06-10T12:22:46Z

The order of the IO library list!

tkelman · 2017-06-10T13:35:52Z

src/registry.jl

+add_format(format"Feather", (), [".feather"], [:FeatherFiles])
+add_format(format"Excel", (), [".xls", ".xlsx"], [:ExcelFiles, LOAD])
+add_format(format"Stata", (), [".dta"], [:StatFiles, LOAD])
+add_format(format"SPSS", (), [".sav", ".por"], [:StatFiles, LOAD])


is there a magic number on this? .sav is very general

I'll try to find one.

davidanthoff · 2017-06-10T15:19:40Z

Thanks, @SimonDanisch! Could you also tag a new release in METADATA?

SimonDanisch · 2017-06-10T15:34:23Z

After fixing #129 ?
Is there a test suite that works with all this new functionality and did you run it?
Just trying to keep the tagging churn low ;)

davidanthoff · 2017-06-11T04:57:25Z

Is there a test suite that works with all this new functionality and did you run it?

Yes, all four packages have unit tests that exercise the FileIO stuff added here and it all works.

But, on merging, I might take a stab at #122 soonish, so maybe we wait with a tag here until I either have done that or given up?

SimonDanisch · 2017-06-11T15:35:17Z

let me know when you're ready for a tag!

Register CSV, Feather, Excel and various stats formats

b5f7b29

davidanthoff force-pushed the tabular-formats branch from 99c8c03 to b5f7b29 Compare June 9, 2017 22:06

tkelman reviewed Jun 9, 2017

View reviewed changes

SimonDanisch merged commit 1578195 into JuliaIO:master Jun 10, 2017

tkelman reviewed Jun 10, 2017

View reviewed changes

davidanthoff deleted the tabular-formats branch June 10, 2017 15:19

plut mentioned this pull request Nov 17, 2021

multiple packages using a type of file? #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register CSV, Feather, Excel and various stats formats #124

Register CSV, Feather, Excel and various stats formats #124

davidanthoff commented Jun 8, 2017

SimonDanisch commented Jun 8, 2017

davidanthoff commented Jun 8, 2017

tkelman Jun 9, 2017

davidanthoff Jun 9, 2017

tkelman Jun 10, 2017

tkelman commented Jun 9, 2017 •

edited

Loading

davidanthoff commented Jun 9, 2017

tkelman commented Jun 9, 2017

davidanthoff commented Jun 9, 2017

tkelman commented Jun 10, 2017

davidanthoff commented Jun 10, 2017

davidanthoff commented Jun 10, 2017

SimonDanisch commented Jun 10, 2017

tkelman commented Jun 10, 2017

SimonDanisch commented Jun 10, 2017

tkelman Jun 10, 2017

davidanthoff Jun 10, 2017

davidanthoff commented Jun 10, 2017

SimonDanisch commented Jun 10, 2017

davidanthoff commented Jun 11, 2017

SimonDanisch commented Jun 11, 2017

		@@ -16,6 +16,13 @@ end

		add_format(format"RData", detect_rdata, [".rda", ".RData", ".rdata"], [:RData, LOAD])

		add_format(format"CSV", (), [".csv"], [:CSVFiles])

Register CSV, Feather, Excel and various stats formats #124

Register CSV, Feather, Excel and various stats formats #124

Conversation

davidanthoff commented Jun 8, 2017

SimonDanisch commented Jun 8, 2017

davidanthoff commented Jun 8, 2017

tkelman Jun 9, 2017

Choose a reason for hiding this comment

davidanthoff Jun 9, 2017

Choose a reason for hiding this comment

tkelman Jun 10, 2017

Choose a reason for hiding this comment

tkelman commented Jun 9, 2017 • edited Loading

davidanthoff commented Jun 9, 2017

tkelman commented Jun 9, 2017

davidanthoff commented Jun 9, 2017

tkelman commented Jun 10, 2017

davidanthoff commented Jun 10, 2017

davidanthoff commented Jun 10, 2017

SimonDanisch commented Jun 10, 2017

tkelman commented Jun 10, 2017

SimonDanisch commented Jun 10, 2017

tkelman Jun 10, 2017

Choose a reason for hiding this comment

davidanthoff Jun 10, 2017

Choose a reason for hiding this comment

davidanthoff commented Jun 10, 2017

SimonDanisch commented Jun 10, 2017

davidanthoff commented Jun 11, 2017

SimonDanisch commented Jun 11, 2017

tkelman commented Jun 9, 2017 •

edited

Loading