Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Column-based faster reader of unnested columns #79

Draft
wants to merge 75 commits into
base: master
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
ba51050
before merge with remote master
xiaodaigh May 13, 2020
baf5f51
finished merging with remote master
xiaodaigh May 13, 2020
39323df
adding tests
xiaodaigh May 13, 2020
a31cbea
Merge branch 'xiaodaigh/missing_value_fix' into xiaodaigh/parquet-writer
xiaodaigh May 13, 2020
e41a113
added tests for wrtier
xiaodaigh May 13, 2020
be7fb94
added readme for test write and used tempname()
xiaodaigh May 13, 2020
866323e
fixed project.toml adding random
xiaodaigh May 13, 2020
f1e70c8
added version to writer
xiaodaigh May 13, 2020
36ec5a9
Merge remote-tracking branch 'upstream/master' into xiaodaigh/parquet…
xiaodaigh May 13, 2020
40cbfef
added missing for Julia 1.0.5
xiaodaigh May 13, 2020
be5e64c
removed progress meter
xiaodaigh May 14, 2020
c97a067
typo
xiaodaigh May 14, 2020
c97b56b
fixed julia fail bug
xiaodaigh May 14, 2020
34ed20a
Update src/writer.jl
xiaodaigh May 16, 2020
781ff7d
minor refactor
xiaodaigh May 16, 2020
c7b829d
Merge branch 'xiaodaigh/parquet-writer' of https://github.com/xiaodai…
xiaodaigh May 16, 2020
620b0f9
created a write encoded data and write definition functions
xiaodaigh May 16, 2020
53382e2
minor bug fix
xiaodaigh May 16, 2020
1bc2add
fixed Julia 1.0.5 issue
xiaodaigh May 16, 2020
5ebe142
minor bug fix
xiaodaigh May 16, 2020
1f02847
removed minor
xiaodaigh May 16, 2020
7652a87
most general form of write_encoded_data
xiaodaigh May 16, 2020
a4e3ffe
refactored into internal methods
xiaodaigh May 16, 2020
06fb699
minor for clarity
xiaodaigh May 16, 2020
c0bd4d0
minor update
xiaodaigh May 16, 2020
2cdcbc9
Merge remote-tracking branch 'upstream/master' into xiaodaigh/parquet…
xiaodaigh May 18, 2020
774bb4c
fixed all comments
xiaodaigh May 18, 2020
36fdd32
Update writer.jl
xiaodaigh May 18, 2020
ba78cb8
made version number of package a constant
xiaodaigh May 18, 2020
656d503
fixed bug of not writing DataFrame properly
xiaodaigh May 18, 2020
57515d6
Merge remote-tracking branch 'upstream/master' into xiaodaigh/parquet…
xiaodaigh May 18, 2020
7eda104
updated parquet
xiaodaigh May 18, 2020
04cca78
removed protobuf
xiaodaigh May 18, 2020
2a9ff1d
upped version to 0.5.1
xiaodaigh May 18, 2020
a6f2a8a
performace improvements, few fixes
tanmaykm May 18, 2020
0a822ae
more performance fixes
tanmaykm May 19, 2020
b16e9ce
Merge remote-tracking branch 'upstream/tan/misc' into xiaodaigh/parqu…
xiaodaigh May 19, 2020
d4f8a94
minor
xiaodaigh May 19, 2020
8617390
Merge pull request #1 from xiaodaigh/faster-column-reader
xiaodaigh May 19, 2020
8432d5c
Update README.md
xiaodaigh May 19, 2020
ef79cd9
Merge remote-tracking branch 'upstream/master'
xiaodaigh May 19, 2020
2d4ed73
sync with master
xiaodaigh May 21, 2020
fb2b3c2
tries to accomodate master
xiaodaigh May 21, 2020
2be1b18
merged with master
xiaodaigh May 21, 2020
6b7bd64
Update test/test_writer.jl
xiaodaigh May 21, 2020
ec800d2
Merge branch 'master' into xiaodaigh/parquet-writer
xiaodaigh May 22, 2020
275e7f2
added little endian writes
xiaodaigh May 22, 2020
c855cb1
Merge remote-tracking branch 'upstream/master' into xiaodaigh/parquet…
xiaodaigh May 22, 2020
bd23d7d
Merge remote-tracking branch 'upstream/master'
xiaodaigh May 22, 2020
dda544c
minor
xiaodaigh May 22, 2020
bcd9a5c
merged with master
xiaodaigh May 22, 2020
1930cc7
fixed test
xiaodaigh May 22, 2020
58e7920
Update src/Parquet.jl
xiaodaigh May 22, 2020
21d645f
Update test/test_writer.jl
xiaodaigh May 22, 2020
54c5f0c
minor fix
xiaodaigh May 22, 2020
05d0ae9
Merge branch 'xiaodaigh/parquet-writer' of https://github.com/xiaodai…
xiaodaigh May 22, 2020
f38e55f
Merge remote-tracking branch 'upstream/master' into xiaodaigh/parquet…
xiaodaigh May 23, 2020
6a94305
minor:
xiaodaigh May 23, 2020
b8c3700
merged master
xiaodaigh May 23, 2020
7046f92
so i dont lose it
xiaodaigh May 23, 2020
6f445a8
Merge pull request #66 from xiaodaigh/xiaodaigh/parquet-writer
tanmaykm May 23, 2020
3c829b7
Merge remote-tracking branch 'upstream/master'
xiaodaigh May 23, 2020
2331e99
got a copy based reader working
xiaodaigh May 24, 2020
8716041
minor
xiaodaigh May 24, 2020
0e550a7
merged with upstream
xiaodaigh May 25, 2020
327c66e
fixed most of the non dictionary value reads
xiaodaigh May 26, 2020
02836c2
more updates
xiaodaigh May 27, 2020
070988e
fixed all bugs
xiaodaigh May 27, 2020
d7a3928
Merge remote-tracking branch 'upstream/master'
xiaodaigh May 27, 2020
08a961b
fixed memory bug
xiaodaigh May 27, 2020
dc619e3
fixed bug with parquet reader
xiaodaigh May 27, 2020
9f50dad
minor bug fix
xiaodaigh May 27, 2020
f9d7822
Merge remote-tracking branch 'upstream/master'
xiaodaigh May 27, 2020
0c81da9
before operating on misssing bytes
xiaodaigh May 29, 2020
f6d2309
before major operation on cutting down on memory usage for missing
xiaodaigh May 30, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
most general form of write_encoded_data
xiaodaigh committed May 16, 2020
commit 7652a87ddb61754887f77c9e6d4cabc50b0ddca1
13 changes: 13 additions & 0 deletions src/writer.jl
Original file line number Diff line number Diff line change
@@ -168,6 +168,7 @@ write_encoded_data(data_to_compress_io, colvals::AbstractVector{Union{Missing, T
write_encoded_data(data_to_compress_io, skipmissing(colvals))

function write_encoded_data(data_to_compress_io, colvals::Union{AbstractVector{String}, SkipMissing{S}}) where S <: AbstractVector{Union{Missing, String}}
""" Write encoded data for String type """
# write the values
for val in colvals
# for string it needs to be stored as BYTE_ARRAY which needs the length
@@ -179,6 +180,7 @@ function write_encoded_data(data_to_compress_io, colvals::Union{AbstractVector{S
end

function write_encoded_data(data_to_compress_io, colvals::Union{AbstractVector{Bool}, SkipMissing{S}}) where S <: AbstractVector{Union{Missing, Bool}}
""" Write encoded data for Bool type """
# write the bitacpked bits
# write a bitarray seems to write 8 bytes at a time
# so write to a tmpio first
@@ -192,16 +194,27 @@ function write_encoded_data(data_to_compress_io, colvals::Union{AbstractVector{B
end

function write_encoded_data(data_to_compress_io, colvals::AbstractArray)
""" Efficient write of encoded data for `isbits` types"""
@assert isbitstype(eltype(colvals))
write(data_to_compress_io, colvals)
end

function write_encoded_data(data_to_compress_io, colvals::SkipMissing)
""" Write of encoded data for skipped missing types"""
for val in colvals
write(data_to_compress_io, val)
end
end

function write_encoded_data(data_to_compress_io, colvals)
""" Write of encoded data for the most general type.
The only requirement is that colvals has to be iterable
"""
for val in skipmissing(colvals)
write(data_to_compress_io, val)
end
end

# TODO set the encoding code into a dictionary
function write_col_chunk(fileio, colvals::AbstractArray, codec, ::Val{PAR2.Encoding.PLAIN})
"""