-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV with long rows matched as plain txt #138
Comments
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
Hi, this is tricky. For 1: just cutting the input at 3072 will sometimes result in an invalid csv because the last line can be truncated. For 2: a single line means a text file containing
will be considered csv. #120 has a similar issue with truncated inputs. Is |
I've decided to do a little rework on how I handle files. Long story short, we support data import from different file types where I relied primarily on this library to determine what decoder to use. Since I have some additional metadata when preparing the import (headers when the API is used; file info when the CLI is used), I rely primarily on that and use this package as a fallback. Regarding the suggested Instead of giving the underlying type detector the first chunk of n bytes to work with, we could iterate through the source's chunks until the type is detected. For the vast majority, the first chunk would result in a correct mime type, and for others (such as Office documents), the m-th chunk would result in a correct mime type. Another interesting idea is instead of returning a single mime type; multiple mime types are returned, ordered by how specific the type is. What do you think? |
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138
@zinedine-be try increasing the number of bytes used for detection with: mimetype.SetLimit(10 000)
// or
mimetype.SetLimit(0) // disable it completely
// and then call detection
mimetype.DetectFile("file.csv") |
Here is a gist with two versions; the first .csv is not detected correctly, while the second one is
https://gist.github.com/tjerman/adbb8f975c9a64c279834544d61a19a0
Expected mime type
text/csv
Returned mime type
text/plain; charset=utf-8
Version of the library you are using
v1.1.2
Output of
go version
go version go1.14.6 linux/amd64
Additional context
I did some digging in the source, and this is what I came up with.
In short, the detection fails when the first two lines match/exceed the
matchers.ReadLimit
. The first .csv in my gist has3072
characters, while the second one has3071
.Keeping that in mind, the following should make sense. The call goes through here where the input is limited to x bytes; then it eventually gets to here where because of the input length the condition passes and the second line is removed, keeping the first line only.
Finally, this bit checks for the number of lines and it fails as there was only one line left.
As for the solution, I'm not entirely sure...
>= 1
-- the csv can have a single line.I can find some time if you need help with this, but I'd like an opinion or two on how you wish to go about fixing this one.
The text was updated successfully, but these errors were encountered: