Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement C Data integration #178

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Implement C Data integration #178

wants to merge 1 commit into from

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Apr 16, 2021

This starts work towards supporting teh C data interface for the arrow
format, as documented
here.

Currently in this PR, it includes struct definitions and basic
methods to allow getting a pointer to an ArrowSchema/ArrowArray
C-compatible struct that can then be populated by another
implementation. For example, with this PR, you can do:

using Arrow, PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
sch = Arrow.CData.getschema() do ptr
    rb.schema._export_to_c(Int(ptr))
end
arr = Arrow.CData.getarray() do ptr
    rb._export_to_c(Int(ptr))
end

Currently, these ArrowSchema/ArrowArray structs are pretty bare
bones, but it at least lays some ground work for integration. Things we
still need/want to make all this nicer to use/work with:

  • Type format string parsing/converting: we need to parse the type
    format strings as outlined
    here
    to figure out what type of data we'll get in the arrays. It'd
    probably be best to add a type field to the ArrowSchema struct that
    we'd populate when converting from CArrowSchema -> ArrowSchema
  • Add a method like Arrow.ArrowVector(::ArrowSchema, ::ArrowArray)
    that produced a concrete ArrowVector subtype, like
    Arrow.Primitive, Arrow.List, etc. This will be a bit tricky,
    because have to follow all the same columnar layout trickery that we
    currently handle for IPC in the table.jl build methods. Perhaps we
    can refactor all that so we can re-use some code? Otherwise, we might
    just need to reimplement a bunch of that logic specific to converting
    ArrrowArrays.
  • That should give a robust consuming story; for producing, we
    probably need a definition like
    Arrow.ArrowSchema(a::Arrow.ArrowVector) that produced a valid
    ArrowSchema, and then overloads per ArrowVector subtype like
    Arrow.ArrowArray(x::Arrow.Primitive) that produced the right
    ArrowArray for a concrete arrow array
  • Then the last piece we need is just figuring out the right mechanics
    for providing a pointer to the CArrowSchema, CArrowArray structs
    once they're populated

If anyone would like to help out, I'm happy to provide as much guidance
as possible so others can get their feet wet in some arrow spec
nitty-gritty.

This starts work towards supporting teh C data interface for the arrow
format, as documented
[here](https://arrow.apache.org/docs/format/CDataInterface.html#).

Currently in this PR, it includes struct definitions and basic
methods to allow getting a pointer to an `ArrowSchema`/`ArrowArray`
C-compatible struct that can then be populated by another
implementation. For example, with this PR, you can do:

```julia
using Arrow, PyCall
pd = pyimport("pandas")
pa = pyimport("pyarrow")
df = pd.DataFrame(py"""{'a': [1, 2, 3, 4, 5], 'b': ['a', 'b', 'c', 'd', 'e']}"""o)
rb = pa.record_batch(df)
sch = Arrow.CData.getschema() do ptr
    rb.schema._export_to_c(Int(ptr))
end
arr = Arrow.CData.getarray() do ptr
    rb._export_to_c(Int(ptr))
end
```

Currently, these `ArrowSchema`/`ArrowArray` structs are pretty bare
bones, but it at least lays some ground work for integration. Things we
still need/want to make all this nicer to use/work with:

  * Type format string parsing/converting: we need to parse the type
  format strings as outlined
  [here](https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings)
  to figure out what type of data we'll get in the arrays. It'd
  probably be best to add a `type` field to the ArrowSchema struct that
  we'd populate when converting from `CArrowSchema` -> `ArrowSchema`
  * Add a method like `Arrow.ArrowVector(::ArrowSchema, ::ArrowArray)`
  that produced a concrete `ArrowVector` subtype, like
  `Arrow.Primitive`, `Arrow.List`, etc. This will be a bit tricky,
  because have to follow all the same columnar layout trickery that we
  currently handle for IPC in the table.jl `build` methods. Perhaps we
  can refactor all that so we can re-use some code? Otherwise, we might
  just need to reimplement a bunch of that logic specific to converting
  `ArrrowArray`s.
  * That should give a robust consuming story; for producing, we
  probably need a definition like
  `Arrow.ArrowSchema(a::Arrow.ArrowVector)` that produced a valid
  `ArrowSchema`, and then overloads per `ArrowVector` subtype like
  `Arrow.ArrowArray(x::Arrow.Primitive)` that produced the right
  `ArrowArray` for a concrete arrow array
  * Then the last piece we need is just figuring out the right mechanics
  for providing a pointer to the `CArrowSchema`, `CArrowArray` structs
  once they're populated

If anyone would like to help out, I'm happy to provide as much guidance
as possible so others can get their feet wet in some arrow spec
nitty-gritty.
@quinnj
Copy link
Member Author

quinnj commented Apr 16, 2021

cc: @sa-, @Moelf

@codecov
Copy link

codecov bot commented Apr 16, 2021

Codecov Report

Merging #178 (005c946) into main (bdd0e54) will decrease coverage by 2.19%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #178      +/-   ##
==========================================
- Coverage   81.34%   79.15%   -2.20%     
==========================================
  Files          25       26       +1     
  Lines        3034     3118      +84     
==========================================
  Hits         2468     2468              
- Misses        566      650      +84     
Impacted Files Coverage Δ
src/Arrow.jl 54.54% <ø> (ø)
src/cinterface.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bdd0e54...005c946. Read the comment docs.

@sa- sa- mentioned this pull request Apr 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant