-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: xlsx improvements #52
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree [company="{your company}"] |
@microsoft-github-policy-service agree |
@hewliyang thanks for the PR. cc @afourney |
If the objective is beautification, what is we added a |
thank for for reviewing @gagb 🤗 well, the right answer is that it is non-trivial because we would have to curate a QA test set and run experiments on it. but what I do know for sure is having
sounds good! |
For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as |
sounds good. my idea of when we would want to keep fully null rows/cols is such as when there are subtables in a sheet & we want to maintain some spatial seperation (or not). added two flags also am now fowarding |
Arguably, if the xlsx cell is empty, the conversion should be empty. If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same). In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
I've pushed a change to:
Additionally:
Regarding
Not sure how verbose you would want the tests to be, because I think for the latter we will need to reparse the string representation. Also I see a related PR #169. If we intend to merge this then another todo would be to wrap all the |
Thanks for adding the --drop flags!
Excel supports ~1.048m rows max, but in my experience it starts to get sluggish with a sheet over 100k, but that's another story. On the pandas using too much memory topic - I'm generally advocating for all projects to replace pandas with polars or duckdb whenever it's possible, especially when working with large datasets. I'm actual planning to open a PR myself in January, to replace all pandas functions with polars whenever it's possible, as I'm expecting this library will be used to convert vast amounts of data, we should make this as efficient as possible. |
a couple simple heuristics that should be safe to apply in any general case:
Unnamed_{i}
and I think we can scrub this outExamples
here is what I have expanded the
test.xlsx
to look like (changes annotated in red):which would produce:
Sheet1
09060124-b5e7-4717-9d07-3c046eb
with the changes:
Sheet1
09060124-b5e7-4717-9d07-3c046eb