Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data encoding problem: text is stringified-bytes like "b'JEANNE D\xe2...'" #17

Open
Yaakov-Belch opened this issue May 10, 2024 · 0 comments

Comments

@Yaakov-Belch
Copy link

The credit card agreements scraping process apparently produced binary data that is stored in text strings like "b'JEANNE D\xe2\x80\x99ARC CREDIT UNION\n...'".

Note that this is a str that contains the representation of binary data, not binary data itself (as binary data cannot be stored in json).

I noticed this in the first ten lines of the file data/train.cfpb_cc.jsonl.xz and suspect that it affects all the records in that file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant