Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't parse_skip() for very many components #714

Closed
wants to merge 1 commit into from

Conversation

mgovers
Copy link
Member

@mgovers mgovers commented Sep 6, 2024

Fixes very slow row-based deserialization cfr. #708 (comment)

NOTE: columnar deserialization is still potentially slow due to very many parse_skip calls when a certain column is not present

Issue was introduced in #708

A couple issues are not yet explained and also need further investigation but are out of scope of the immediate regression mitigation:

  • why does the python deserializer end up in this edge case? does it call parse multiple times on different components?
  • how can we improve the efficiency of the deserializer when the columnar dataset contains few columns?

@mgovers mgovers added the bug Something isn't working label Sep 6, 2024
@mgovers mgovers self-assigned this Sep 6, 2024
Copy link

sonarqubecloud bot commented Sep 6, 2024

@TonyXiang8787 TonyXiang8787 marked this pull request as draft September 6, 2024 12:18
@TonyXiang8787
Copy link
Member

@mgovers

I put this PR on-hold. After running benchmark on my side in Python, I do not see a significant performance difference between 1.9.35 and 1.9.34. The benchmark is using your script in WSL2.

1.9.35

done generating
serializing: tmp/data/all_random_compact.msgpack
done serializing: tmp/data/all_random_compact.msgpack
timed msgpack_serialize_to_file(tmp/data/all_random_compact.msgpack): 0.5882223630032968 s
deserializing: tmp/data/all_random_compact.msgpack
done deserializing: tmp/data/all_random_compact.msgpack
timed msgpack_deserialize_from_file(tmp/data/all_random_compact.msgpack): 1.1101793640118558 s
serializing: tmp/data/all_random.msgpack
done serializing: tmp/data/all_random.msgpack
timed msgpack_serialize_to_file(tmp/data/all_random.msgpack): 1.023680468002567 s
deserializing: tmp/data/all_random.msgpack
done deserializing: tmp/data/all_random.msgpack
timed msgpack_deserialize_from_file(tmp/data/all_random.msgpack): 2.2592216260090936 s
serializing: tmp/data/all_random_compact.json
done serializing: tmp/data/all_random_compact.json
timed json_serialize_to_file(tmp/data/all_random_compact.json): 10.663685843988787 s
deserializing: tmp/data/all_random_compact.json
done deserializing: tmp/data/all_random_compact.json
timed json_deserialize_from_file(tmp/data/all_random_compact.json): 8.332203558995388 s
serializing: tmp/data/all_random.json
done serializing: tmp/data/all_random.json
timed json_serialize_to_file(tmp/data/all_random.json): 13.262728639005218 s
deserializing: tmp/data/all_random.json
done deserializing: tmp/data/all_random.json
timed json_deserialize_from_file(tmp/data/all_random.json): 13.017963435995625 s

1.9.34

done generating
serializing: tmp/data/all_random_compact.msgpack
done serializing: tmp/data/all_random_compact.msgpack
timed msgpack_serialize_to_file(tmp/data/all_random_compact.msgpack): 0.6074205159966368 s
deserializing: tmp/data/all_random_compact.msgpack
done deserializing: tmp/data/all_random_compact.msgpack
timed msgpack_deserialize_from_file(tmp/data/all_random_compact.msgpack): 1.1239508260041475 s
serializing: tmp/data/all_random.msgpack
done serializing: tmp/data/all_random.msgpack
timed msgpack_serialize_to_file(tmp/data/all_random.msgpack): 1.10275221397751 s
deserializing: tmp/data/all_random.msgpack
done deserializing: tmp/data/all_random.msgpack
timed msgpack_deserialize_from_file(tmp/data/all_random.msgpack): 2.401370646985015 s
serializing: tmp/data/all_random_compact.json
done serializing: tmp/data/all_random_compact.json
timed json_serialize_to_file(tmp/data/all_random_compact.json): 10.434441459015943 s
deserializing: tmp/data/all_random_compact.json
done deserializing: tmp/data/all_random_compact.json
timed json_deserialize_from_file(tmp/data/all_random_compact.json): 8.65973894900526 s
serializing: tmp/data/all_random.json
done serializing: tmp/data/all_random.json
timed json_serialize_to_file(tmp/data/all_random.json): 13.544286515010754 s
deserializing: tmp/data/all_random.json
done deserializing: tmp/data/all_random.json
timed json_deserialize_from_file(tmp/data/all_random.json): 12.847745289996965 s

@TonyXiang8787
Copy link
Member

TonyXiang8787 commented Sep 6, 2024

@mgovers I also run the script in Windows. There is no significant performance difference.

1.9.35

done generating
serializing: tmp\data\all_random_compact.msgpack
done serializing: tmp\data\all_random_compact.msgpack
timed msgpack_serialize_to_file(tmp\data\all_random_compact.msgpack): 0.8647662999574095 s
deserializing: tmp\data\all_random_compact.msgpack
done deserializing: tmp\data\all_random_compact.msgpack
timed msgpack_deserialize_from_file(tmp\data\all_random_compact.msgpack): 2.770424999995157 s
serializing: tmp\data\all_random.msgpack
done serializing: tmp\data\all_random.msgpack
timed msgpack_serialize_to_file(tmp\data\all_random.msgpack): 1.6302193999290466 s
deserializing: tmp\data\all_random.msgpack
done deserializing: tmp\data\all_random.msgpack
timed msgpack_deserialize_from_file(tmp\data\all_random.msgpack): 5.407391499960795 s
serializing: tmp\data\all_random_compact.json
done serializing: tmp\data\all_random_compact.json
timed json_serialize_to_file(tmp\data\all_random_compact.json): 29.098121999995783 s
deserializing: tmp\data\all_random_compact.json
done deserializing: tmp\data\all_random_compact.json
timed json_deserialize_from_file(tmp\data\all_random_compact.json): 15.278849299997091 s
serializing: tmp\data\all_random.json
done serializing: tmp\data\all_random.json
timed json_serialize_to_file(tmp\data\all_random.json): 37.63965840009041 s
deserializing: tmp\data\all_random.json
done deserializing: tmp\data\all_random.json
timed json_deserialize_from_file(tmp\data\all_random.json): 21.915195000125095 s

1.9.34

done generating
serializing: tmp\data\all_random_compact.msgpack
done serializing: tmp\data\all_random_compact.msgpack
timed msgpack_serialize_to_file(tmp\data\all_random_compact.msgpack): 0.9519388000480831 s
deserializing: tmp\data\all_random_compact.msgpack
done deserializing: tmp\data\all_random_compact.msgpack
timed msgpack_deserialize_from_file(tmp\data\all_random_compact.msgpack): 2.7140329000540078 s
serializing: tmp\data\all_random.msgpack
done serializing: tmp\data\all_random.msgpack
timed msgpack_serialize_to_file(tmp\data\all_random.msgpack): 1.7673208001069725 s
deserializing: tmp\data\all_random.msgpack
done deserializing: tmp\data\all_random.msgpack
timed msgpack_deserialize_from_file(tmp\data\all_random.msgpack): 5.330735600087792 s
serializing: tmp\data\all_random_compact.json
done serializing: tmp\data\all_random_compact.json
timed json_serialize_to_file(tmp\data\all_random_compact.json): 44.06123959994875 s
deserializing: tmp\data\all_random_compact.json
done deserializing: tmp\data\all_random_compact.json
timed json_deserialize_from_file(tmp\data\all_random_compact.json): 18.021055599907413 s
serializing: tmp\data\all_random.json
done serializing: tmp\data\all_random.json
timed json_serialize_to_file(tmp\data\all_random.json): 56.36484510009177 s
deserializing: tmp\data\all_random.json
done deserializing: tmp\data\all_random.json
timed json_deserialize_from_file(tmp\data\all_random.json): 36.006855400046334 s

@mgovers
Copy link
Member Author

mgovers commented Sep 6, 2024

Did you pip install for each version?

@mgovers
Copy link
Member Author

mgovers commented Sep 9, 2024

I have no idea why but I also am not able to reproduce the performance regression in the Python package I found on Friday anymore; neither in a custom built package using editable mode, nor on the package pulled from PyPI.

I am, however, still able to reproduce the problem in which otherwise skipped components are terribly inefficient to parse.

@TonyXiang8787
Copy link
Member

@mgovers Since it does not affect current row-based deserialization. I hereby close this PR. You can continue to investigate the issue with parse_skip and columnar deserialization.

@TonyXiang8787 TonyXiang8787 deleted the feature/temporary-fix-deserializer branch September 9, 2024 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants