CSV with long rows matched as plain txt #138

tjerman · 2021-01-15T10:21:45Z

Here is a gist with two versions; the first .csv is not detected correctly, while the second one is

https://gist.github.com/tjerman/adbb8f975c9a64c279834544d61a19a0

Expected mime type
text/csv

Returned mime type
text/plain; charset=utf-8

Version of the library you are using
v1.1.2

Output of go version
go version go1.14.6 linux/amd64

Additional context
I did some digging in the source, and this is what I came up with.
In short, the detection fails when the first two lines match/exceed the matchers.ReadLimit. The first .csv in my gist has 3072 characters, while the second one has 3071.

Keeping that in mind, the following should make sense. The call goes through here where the input is limited to x bytes; then it eventually gets to here where because of the input length the condition passes and the second line is removed, keeping the first line only.

Finally, this bit checks for the number of lines and it fails as there was only one line left.

As for the solution, I'm not entirely sure...

Should we just remove this if block as the input is already limited by the caller.
Should the "number of lines" check here be changed to >= 1 -- the csv can have a single line.

I can find some time if you need help with this, but I'd like an opinion or two on how you wish to go about fixing this one.

The text was updated successfully, but these errors were encountered:

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

gabriel-vasile · 2021-01-18T13:54:30Z

Hi, this is tricky.

For 1: just cutting the input at 3072 will sometimes result in an invalid csv because the last line can be truncated.
For example, a csv with 3 columns is truncated at 3072 resulting in the last line having 2 elements instead of 3.
An error will occur here.

For 2: a single line means a text file containing

well, hello there

will be considered csv.

#120 has a similar issue with truncated inputs. Is SetReadLimit a good solution for your use case? If so I will just add it, else just tell if you see any other approach to the problem.

tjerman · 2021-01-19T10:23:31Z

I've decided to do a little rework on how I handle files.

Long story short, we support data import from different file types where I relied primarily on this library to determine what decoder to use. Since I have some additional metadata when preparing the import (headers when the API is used; file info when the CLI is used), I rely primarily on that and use this package as a fallback.
This way, this bug is no longer game-breaking.

Regarding the suggested SetReadLimit; from the linked issue and some other issue that I've stumbled upon (could be the same issue, I can't recall); I don't think it's such a good idea. Sure increasing the cut-off point will make it work for more cases, but it can still fail if it's a fixed number. You could adjust the number based on the file size, but that could impact performance when working with large files. I did some thinking and came up with a cute little idea:

Instead of giving the underlying type detector the first chunk of n bytes to work with, we could iterate through the source's chunks until the type is detected. For the vast majority, the first chunk would result in a correct mime type, and for others (such as Office documents), the m-th chunk would result in a correct mime type.

Another interesting idea is instead of returning a single mime type; multiple mime types are returned, ordered by how specific the type is.
This would help with above CSV ambiguity; instead of being a text/csv or a text/plain, it would be a [text/csv, text/plain] the user decides what type they want to go with. This would probably be useful for other cases, but currently, no real cases come to mind.

What do you think?

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

gabriel-vasile · 2022-09-05T14:13:08Z

@zinedine-be try increasing the number of bytes used for detection with:

mimetype.SetLimit(10 000) 
// or 
mimetype.SetLimit(0) // disable it completely

// and then call detection
mimetype.DetectFile("file.csv")

tjerman added a commit to cortezaproject/corteza that referenced this issue Jan 15, 2021

Fix record import via some CSV files

3e80a0c

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

tjerman added a commit to cortezaproject/corteza that referenced this issue Jan 15, 2021

Fix record import via some CSV files

5601e88

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

tjerman added a commit to cortezaproject/corteza that referenced this issue Jan 15, 2021

Fix record import via some CSV files

ce21db2

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

tjerman added a commit to cortezaproject/corteza that referenced this issue Jan 15, 2021

Fix record import via some CSV files

dbe995e

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 22, 2021

Fix record import via some CSV files

df74027

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 22, 2021

Fix record import via some CSV files

0870337

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 25, 2021

Fix record import via some CSV files

e32db50

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 25, 2021

Fix record import via some CSV files

e33bb21

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

darh pushed a commit to cortezaproject/corteza that referenced this issue Jan 25, 2021

Fix record import via some CSV files

738cb85

Some CSV files failed to detect as text/csv so the import failed. ref: gabriel-vasile/mimetype#138

sergolius mentioned this issue Oct 31, 2021

incorrect tag sem version #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV with long rows matched as plain txt #138

CSV with long rows matched as plain txt #138

tjerman commented Jan 15, 2021

gabriel-vasile commented Jan 18, 2021 •

edited

Loading

tjerman commented Jan 19, 2021

gabriel-vasile commented Sep 5, 2022

CSV with long rows matched as plain txt #138

CSV with long rows matched as plain txt #138

Comments

tjerman commented Jan 15, 2021

gabriel-vasile commented Jan 18, 2021 • edited Loading

tjerman commented Jan 19, 2021

gabriel-vasile commented Sep 5, 2022

gabriel-vasile commented Jan 18, 2021 •

edited

Loading