Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV with long rows matched as plain txt #138

Open
tjerman opened this issue Jan 15, 2021 · 4 comments
Open

CSV with long rows matched as plain txt #138

tjerman opened this issue Jan 15, 2021 · 4 comments

Comments

@tjerman
Copy link

tjerman commented Jan 15, 2021

Here is a gist with two versions; the first .csv is not detected correctly, while the second one is

https://gist.github.com/tjerman/adbb8f975c9a64c279834544d61a19a0

Expected mime type
text/csv

Returned mime type
text/plain; charset=utf-8

Version of the library you are using
v1.1.2

Output of go version
go version go1.14.6 linux/amd64

Additional context
I did some digging in the source, and this is what I came up with.
In short, the detection fails when the first two lines match/exceed the matchers.ReadLimit. The first .csv in my gist has 3072 characters, while the second one has 3071.

Keeping that in mind, the following should make sense. The call goes through here where the input is limited to x bytes; then it eventually gets to here where because of the input length the condition passes and the second line is removed, keeping the first line only.

Finally, this bit checks for the number of lines and it fails as there was only one line left.

As for the solution, I'm not entirely sure...

  1. Should we just remove this if block as the input is already limited by the caller.
  2. Should the "number of lines" check here be changed to >= 1 -- the csv can have a single line.

I can find some time if you need help with this, but I'd like an opinion or two on how you wish to go about fixing this one.

tjerman added a commit to cortezaproject/corteza that referenced this issue Jan 15, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
tjerman added a commit to cortezaproject/corteza that referenced this issue Jan 15, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
tjerman added a commit to cortezaproject/corteza that referenced this issue Jan 15, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
tjerman added a commit to cortezaproject/corteza that referenced this issue Jan 15, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
@gabriel-vasile
Copy link
Owner

gabriel-vasile commented Jan 18, 2021

Hi, this is tricky.

For 1: just cutting the input at 3072 will sometimes result in an invalid csv because the last line can be truncated.
For example, a csv with 3 columns is truncated at 3072 resulting in the last line having 2 elements instead of 3.
An error will occur here.

For 2: a single line means a text file containing

well, hello there

will be considered csv.

#120 has a similar issue with truncated inputs. Is SetReadLimit a good solution for your use case? If so I will just add it, else just tell if you see any other approach to the problem.

@tjerman
Copy link
Author

tjerman commented Jan 19, 2021

I've decided to do a little rework on how I handle files.

Long story short, we support data import from different file types where I relied primarily on this library to determine what decoder to use. Since I have some additional metadata when preparing the import (headers when the API is used; file info when the CLI is used), I rely primarily on that and use this package as a fallback.
This way, this bug is no longer game-breaking.

Regarding the suggested SetReadLimit; from the linked issue and some other issue that I've stumbled upon (could be the same issue, I can't recall); I don't think it's such a good idea. Sure increasing the cut-off point will make it work for more cases, but it can still fail if it's a fixed number. You could adjust the number based on the file size, but that could impact performance when working with large files. I did some thinking and came up with a cute little idea:

Instead of giving the underlying type detector the first chunk of n bytes to work with, we could iterate through the source's chunks until the type is detected. For the vast majority, the first chunk would result in a correct mime type, and for others (such as Office documents), the m-th chunk would result in a correct mime type.

Another interesting idea is instead of returning a single mime type; multiple mime types are returned, ordered by how specific the type is.
This would help with above CSV ambiguity; instead of being a text/csv or a text/plain, it would be a [text/csv, text/plain] the user decides what type they want to go with. This would probably be useful for other cases, but currently, no real cases come to mind.

What do you think?

darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 22, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 22, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 25, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 25, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 25, 2021
Some CSV files failed to detect as text/csv so the import failed.
ref: gabriel-vasile/mimetype#138
@gabriel-vasile
Copy link
Owner

@zinedine-be try increasing the number of bytes used for detection with:

mimetype.SetLimit(10 000) 
// or 
mimetype.SetLimit(0) // disable it completely

// and then call detection
mimetype.DetectFile("file.csv")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@gabriel-vasile @tjerman and others