-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quoted CRs in CRLF file incorrectly result in validation errors #198
Comments
By default, the type of line endings used are auto-detected: …but the CSV is parsed row-by-row: So in the case described above, auto-detect gets it wrong: irb(main):001:0> require 'csv'
=> true
irb(main):002:0> # load the data above
irb(main):003:0* contents = ["Foo,Bsr,Baz\r\n", "Biff,Baff,\"Boff\n", "\"\r\n", "Qux,Teaspoon,Doge\r\n"]
=> ["Foo,Bsr,Baz\r\n", "Biff,Baff,\"Boff\n", "\"\r\n", "Qux,Teaspoon,Doge\r\n"]
irb(main):004:0>
irb(main):005:0* # try to parse the second row, auto-detecting line endings
irb(main):006:0* CSV.parse_line(contents[1] + contents[2], {:row_sep => :auto})
CSV::MalformedCSVError: Unclosed quoted field on line 1.
from …/csv.rb:1872:in `block in shift'
from …/csv.rb:1779:in `loop'
from …/csv.rb:1779:in `shift'
from …/csv.rb:1310:in `parse_line'
from (irb):6
from …/irb:11:in `<main>' …But if you manually specify the line endings, everything works correctly: irb(main):007:0* # try to parse the second row, manually specifying '\r\n' line endings
irb(main):008:0* CSV.parse_line(contents[1] + contents[2], {:row_sep => '\r\n'})
=> ["Biff", "Baff", "Boff\n"] |
I guess the solution would be to set |
Ah yes, @JeniT and I were discussing this the other day, and we wondered if it was because of mixed line breaks as you've found. I think your solution will probably work; I need to check what the line-break behaviour inside quotes should be as well. |
I realised I should be explicit that parsing the whole thing with auto-detect line endings switched on works fine: irb(main):009:0* CSV.parse(contents.join, {:row_sep => :auto})
=> [["Foo", "Bsr", "Baz"], ["Biff", "Baff", "Boff\n"], ["Qux", "Teaspoon", "Doge"]] So this is only an issue when parsing row-by-row. |
The contributing guidelines say: "Before you start coding, please reach out to us either on our gitter channel or by tagging a repository administrator on the issue ticket you are interested in contributing towards to indicate your interest in helping." I am not finding a list of the repository admins, and apparently I need push access to a repo to get that info from GitHub API, so I'm making assumptions based on who has merged all recent PRs and tagging @Floppy here to indicate that I'd like to attempt a fix for this issue. Rationale/use case/background: We use csvlint in a batch data import application to validate user-generated CSVs before allowing import of data in the files. Using csvlint to prevent the import of CSVs with issues like ragged rows, invalid encoding, and stray/unclosed quotes is important to prevent clients from inadvertently loading malformed data. This issue is currently blocking some of our clients from importing valid CSVs with mixed EOL characters inside quoted values (an artifact of original data source, and what OS they used to prepare the data, apparently). If you export data from our application as CSV on a Windows machine for roundtripping via our import tool, you get mixed EOL characters and can't re-ingest the data because csvlint flags the CSV as invalid. Our users are primarily collection managers in GLAM institutions, not data wrangling/tech savvy people; they are going to tune out/give up if anyone starts explaining CR, LF, vs CRLF to them. Mixed EOL characters are handled fine by the standard Ruby CSV library, and are not prohibited by RFC 4180 according to my reading, so csvlint flagging these as invalid needs to be fixed. |
Hey, I solved this in two different ways, which can be examined here: https://github.com/kspurgin/csvlint.rb/pulls Everything in the I prefer PR2 as it feels more expressive of the actual problem, and less hacky than the other. From what I can tell, the root of the problem is: once we are at PR2 says "Ok, whether the correct EOL char (according to stated dialect) was used, or whether there were mixed EOL chars is all handled elsewhere. If If you let me know which of these is acceptable (if either is), I'll do a real PR with better commit messages. |
So e.g.
…results in:
According to the RFC 4180, though, I think this should be fine.
The text was updated successfully, but these errors were encountered: