Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register CSV, Feather, Excel and various stats formats #124

Merged
merged 1 commit into from
Jun 10, 2017

Conversation

davidanthoff
Copy link
Contributor

I'm about to register those four packages in METADATA.jl. There is a bit of a chicken and egg situation here: the tests in those four packages can only pass once this here is merged and tagged, but this here obviously only works if those four packages are registered.

Maybe the following would work: could this be merged onto master now, but not yet tagged? I'll then modify the CI builds for the four packages so that they run off the FileIO master branch. That should make those tests pass, then we can tag a new version of FileIO?

@SimonDanisch & @timholy again, thanks for helping me out with this!

@SimonDanisch
Copy link
Member

Wait, why is this passing? I guess the failures in #123 might actually have been related?

@davidanthoff
Copy link
Contributor Author

Uh, that is very suspicions... This PR here doesn't have the changes in #123... So maybe #123 actually did introduce bugs?? I'll double check.

@@ -16,6 +16,13 @@ end

add_format(format"RData", detect_rdata, [".rda", ".RData", ".rdata"], [:RData, LOAD])

add_format(format"CSV", (), [".csv"], [:CSVFiles])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems controversial, there are many many different CSV reader options around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment, essentially the CSVFiles package exists because there are multiple potential target formats for a CSV file and there are multiple packages that do CSV reading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you're making this depend on your specific interpretation of how to enable interoperability between them. FileIO has so far been about fairly straightforward mappings of file formats to individual loader packages, not file formats to another intermediate abstraction layer. And there are multiple, not universally better or worse, abstraction layers around.

@tkelman
Copy link
Contributor

tkelman commented Jun 9, 2017

Why aren't Feather, Stata, and Excel formats done directly by the Feather.jl, ReadStat.jl, and ExcelReaders.jl loader packages? There's a lot of excess IterableTables dependencies (especially problematic ones, like Requires.jl) in these new *Files packages that FileIO shouldn't need to load a file. FileIO should be unopinionated, load the file for you and leave manipulations on the data to other packages.

@davidanthoff
Copy link
Contributor Author

FileIO should be unopinionated, load the file for you and leave manipulations on the data to other packages.

That is what this design enables. There is clearly no one data structure for tabular data in julia, instead there are about a gazillion ones. So load("somefile.csv") can't return a concrete, loaded representation of the CSV content without being opinionated. This design takes a different approach: load("somefile.csv") returns an instance of CSVFile, which as a data structure just holds the filename of the corresponding CSV file and some config options specific to CSV reading. Data will only be loaded lazily once you pass this CSVFile to some other function, and how exactly the data is then loaded depends on what the consuming function wants to do.

One trait that the CSVFile implements currently is the iterable table trait, i.e. you can iterate through the tabular data of the file. This opens up compat with all the sinks implemented in IterableTables.jl right now (11 as of right now). But this is not exclusive, any consuming function can also decide to use a completely different load routine. and CSVFile could easily also implement other traits that allow tabular data extraction. So this design does not hard code iterable tables as the only option for these tabular files loaded via FileIO, but it does provide it as one of potentially many options.

In general, this enables all sorts of cool syntax:

load("file.csv") |> DataFrame # Construct a DataFrame, currently using iterable tables, which uses TextParse.jl for parsing
load("file.csv") |> DataTable # If DataTables wanted to use CSV.jl instead for loading stuff, it could easily opt to do that
load("file.csv") |> @query(i, begin
    @where some_condition
    @select {i.a, i.b}
end) |> save("output.feather")
load("file.csv") |> plot() # This should work with VegaLite soon

And it all mixes and matches as you wish.

Why aren't Feather and Stata formats done directly by Feather.jl and ReadStat.jl?

Because they are concrete implementations of a specific load algorithm. But different consuming functions might want to use different packages for loading, depending on e.g. their internal data structure etc. Having this one extra step in-between makes it feasible to not hard code one load implementation per file, but make the choice of load routine depend on the file type and the target function.

@tkelman
Copy link
Contributor

tkelman commented Jun 9, 2017

And here you're enforcing IterableTables, and therefore Requires, NamedTuples, and more to just load a file! That's way way overkill.

@davidanthoff
Copy link
Contributor Author

And here you're enforcing IterableTables, and therefore Requires, NamedTuples, and more to just load a file! That's way way overkill.

Well, you can't load a file without some dependencies :) Pretty much any format here registered in FileIO brings in way more dependencies than this. Also, there is a clear path to get rid of all dependencies in IterableTables.jl, i.e. in the julia 1.0 time frame IterableTables will have no dependencies at all and consist of about 50 lines code total.

So, mid-term this will impose IterableTables (a 50 lines of code, harmless package) as a dependency, but it will not force the whole stack over to use iterable tables, this design here is explicitly set up so that other approaches can co-exist.

If someone has an idea for a less opinionated design, great, but I really don't see how this is overkill at all.

@tkelman
Copy link
Contributor

tkelman commented Jun 10, 2017

Well, you can't load a file without some dependencies

You don't need all of TextParse, NullableArrays, PooledArrays, WeakRefStrings, IterableTables, NamedTuples, Requires, DataValues, DataTables, CategoricalArrays, StatsBase, SortingAlgorithms, Reexport, and DataStreams to load a csv file. Some subset sure, depending on what you want, but you don't need all of it, and that's what this does.

(a 50 lines of code, harmless package) as a dependency, but it will not force the whole stack over to use iterable tables, this design here is explicitly set up so that other approaches can co-exist.

How so, when load("foo.csv") now dispatches to your specific stack of how formats should be implemented and converted?

cool syntax

You're going to find a lot of disagreement on that one.

@davidanthoff
Copy link
Contributor Author

You don't need all of TextParse, NullableArrays, PooledArrays, WeakRefStrings, IterableTables, NamedTuples, Requires, DataValues, DataTables, CategoricalArrays, StatsBase, SortingAlgorithms, Reexport, and DataStreams to load a csv file. Some subset sure, depending on what you want, but you don't need all of it, and that's what this does.

I have a plan how to get rid of the DataTables dependency, using that is more of a shortcut right now because I want to have something ready to show for juliacon. Dropping that should get rid of a whole bunch of these dependencies (CategoricalArrays, StatsBase, SortingAlgorithms, Reexport and DataStreams). TextParse currently brings in NullableArrays (but might drop that dependency), PooledArrays and WeakRefStrings, so nothing I can do about that. NamedTuples and Requires will go away over time. So this is on a path where the requirements in CSVFiles will be whatever TextParse brings along, DataValues (but if someone finds a solution to the problems I outlined with the Union{T,Null} approach, this might also go away) and then IterableTables (with no dependencies).

How so, when load("foo.csv") now dispatches to your specific stack of how formats should be implemented and converted?

load("foo.csv") returns a CSVFile type. You can do whatever you want with that, if you have a function foo(a::CSVFile) that doesn't want to use the iterable tables approach, you can do that, no problem. You get the filename and the csv parsing options in that type, and then you can do whatever you want. In that case, not a single line of iterable tables code will ever run.

cool syntax

You're going to find a lot of disagreement on that one.

You don't have to use it, you can also just do DataFrame(load("file.csv")), plot(load("file.csv")) or save("file.feather", load("file.csv")) or any other combination. I do believe that having a piping like, dplyr like thing for julia is something that a critical mass of people would like to see, though. That is where all of this is going.

@davidanthoff
Copy link
Contributor Author

I read through the other issues here now, and I think this debate is misguided. One can register multiple packages for a single format (see here). Merging this PR here does not prevent any other package from also registering for the CSV file format with FileIO, in fact there seems specific support for that in the package.

It seems an unresolved issue is how one can deal with a situation where two packages are installed that can handle the same file format (#46). In my mind the solution to that should be one of the suggestions made in #46. The solution should not be to put all file format registrations on hold for which there might be multiple packages out there that can handle it. Certainly is seems that in the past packages were registered for formats as they came up, so I don't see why this PR here should be treated differently.

@SimonDanisch Is that a fair characterization of how things were handled here in the past?

@SimonDanisch
Copy link
Member

Yes, this is exactly the reason why FileIO exists: to not care about how heavy weight an IO package is! If anyone else ports their CSV reader to support FileIO, they should just add it to the registry as well.
We can then discuss which one is the best and which one should take priority (FileIO supports preferring one IO package as the standard loader).

@tkelman
Copy link
Contributor

tkelman commented Jun 10, 2017

How are priorities handled?

@SimonDanisch
Copy link
Member

The order of the IO library list!

@SimonDanisch SimonDanisch merged commit 1578195 into JuliaIO:master Jun 10, 2017
add_format(format"Feather", (), [".feather"], [:FeatherFiles])
add_format(format"Excel", (), [".xls", ".xlsx"], [:ExcelFiles, LOAD])
add_format(format"Stata", (), [".dta"], [:StatFiles, LOAD])
add_format(format"SPSS", (), [".sav", ".por"], [:StatFiles, LOAD])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a magic number on this? .sav is very general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to find one.

@davidanthoff davidanthoff deleted the tabular-formats branch June 10, 2017 15:19
@davidanthoff
Copy link
Contributor Author

Thanks, @SimonDanisch! Could you also tag a new release in METADATA?

@SimonDanisch
Copy link
Member

After fixing #129 ?
Is there a test suite that works with all this new functionality and did you run it?
Just trying to keep the tagging churn low ;)

@davidanthoff
Copy link
Contributor Author

Is there a test suite that works with all this new functionality and did you run it?

Yes, all four packages have unit tests that exercise the FileIO stuff added here and it all works.

But, on merging, I might take a stab at #122 soonish, so maybe we wait with a tag here until I either have done that or given up?

@SimonDanisch
Copy link
Member

let me know when you're ready for a tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants