Skip to content
This repository has been archived by the owner on Aug 13, 2020. It is now read-only.

dramatic loss of timestamp accuracy! #11

Open
randomgambit opened this issue Mar 4, 2020 · 2 comments
Open

dramatic loss of timestamp accuracy! #11

randomgambit opened this issue Mar 4, 2020 · 2 comments

Comments

@randomgambit
Copy link

Hi @hannesmuehleisen I think I have found a quite severe bug
Consider this example in Python:

import pyarrow as pa
import pandas as pd
import numpy as np

mydf = pd.DataFrame({'mytime' : [pd.to_datetime('2020-01-01 10:10:10.123456'),
                                pd.to_datetime('2020-01-01 10:10:10.234567')],
                     'value' : [1,2]})

mydf.head()
Out[137]: 
                      mytime  value
0 2020-01-01 10:10:10.123456      1
1 2020-01-01 10:10:10.234567      2

#now writing to parquet file
mydf.to_parquet('testfile_spark.pq', engine = 'pyarrow', flavor = 'spark')

Now reading the file in Python works fine

one = pd.read_parquet('testfile_spark.pq')

one.head()
Out[134]: 
                      mytime  value
0 2020-01-01 10:10:10.123456      1
1 2020-01-01 10:10:10.234567      2

Unfortunately, reading the file in R using miniparquet floors the timestamp at the second level.

> mymini <- read_parquet('testfile_spark.pq')
> mymini
               mytime value
1 2020-01-01 10:10:10     1
2 2020-01-01 10:10:10     2

What do you think?

Thanks!

@randomgambit
Copy link
Author

@hannesmuehleisen are you there? please let me know if you are not interested anymore in maintaining this great package! thx!

@hannes
Copy link
Owner

hannes commented Mar 17, 2020

Yes I'm there we will circle back to miniparquet. I am currently working to support nested tables in DuckDB, which will then also come to miniparquet. Also happy to review a PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants