Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Inferencing date(time) columns #697

Closed
aborruso opened this issue Dec 27, 2022 · 7 comments
Closed

Auto Inferencing date(time) columns #697

aborruso opened this issue Dec 27, 2022 · 7 comments

Comments

@aborruso
Copy link
Contributor

Hi,
if I import in duckdb this input CSV file

field1,date1
4500233821,2018-11-29 12:15:06.353
4500245641,2018-11-29 12:15:06.353
4500242807,2018-11-29 12:55:03.097
4500233852,2018-11-29 12:55:03.097
4500234511,2018-11-29 13:09:25.697
4500242786,2018-11-29 13:17:48.977
4500242786,2018-11-29 13:17:48.977
4500253356,2018-11-29 13:28:45.723

I get this schema, by automatic inferencing:

CREATE TABLE test(field1 BIGINT, date1 TIMESTAMP);

And if I apply datefmt to this input CSV I have a right datetime field

field1,date1
4500233821,2018-11-29T12:15:06.353+00:00
4500245641,2018-11-29T12:15:06.353+00:00
4500242807,2018-11-29T12:55:03.097+00:00
4500233852,2018-11-29T12:55:03.097+00:00
4500234511,2018-11-29T13:09:25.697+00:00
4500242786,2018-11-29T13:17:48.977+00:00
4500242786,2018-11-29T13:17:48.977+00:00
4500253356,2018-11-29T13:28:45.723+00:00

But if I use schema or to (to create a parquet file) to the original file, the date1 field is mapped as string field.

It would be great to have auto inferencing of datetime fields, also when they are not written perfectly (in my input I do not have the T and I have a space).

Thank you

@jqnatividad jqnatividad added enhancement New feature or request. Once marked with this label, its in the backlog. and removed enhancement New feature or request. Once marked with this label, its in the backlog. labels Dec 27, 2022
@jqnatividad
Copy link
Collaborator

Hi @aborruso , for the schema command, you need to specify the --strict-dates option.

We went for a "relaxed" default setting when inferring date/date-time when generating JSONschema - see this discussion for more context.

Regardless, I expanded schema's usage text to make it clearer - 3d22829.

As for to date inferencing, I'll defer to @kindly...

@kindly
Copy link
Contributor

kindly commented Dec 29, 2022

For the parquet conversation we are using the arrow csv library and its date handling is not great. It only allows one date format per file and expects all dates in that file to be that format. The to commands does do flexible date detection but I did not want to risk any errors so defaulted to string for parquet, till I found a better solution. However I could just default to string for cases that deviate from the most common format in the file but was hoping for propper solution.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Dec 30, 2022

Perhaps, we can just cite the arrow csv date format limitation and suggest the user normalize the dates first with the apply datefmt command - which can pretty much convert all kinds of date formats to rfc3339 format.

In addition, if you're planning to do this in a pipeline, create a JSONschema with --strict-dates, and validate them first, and possibly do an apply datefmt if required, before converting the CSV with to.

That is, until the fixdata command is released :)

@kindly
Copy link
Contributor

kindly commented Jan 21, 2023

@aborruso @jqnatividad I finally worked out exactly what date formats the arrow library will accept by default and for those cases the date type for parquet should work. They are

"rfc3339"
"%Y-%m-%d %H:%M:%S%.f%:z"
"%Y-%m-%dT%H:%M:%S%.f"
"%Y-%m-%dT%H:%M:%S"
"%Y-%m-%d %H:%M:%S%.f"
"%Y-%m-%d %H:%M:%S"

This will be fixed when #737 is merged.

@aborruso
Copy link
Contributor Author

thank you very much @kindly

@jqnatividad
Copy link
Collaborator

Closing this now that #736 has been merged.

Did some quick tests and am currently creating tests.

@kindly
Copy link
Contributor

kindly commented Jan 21, 2023

@jqnatividad this test file may be useful for dateformats https://github.com/kindly/csvs_convert/blob/main/fixtures/parquet_date.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants