don't parse_skip() for very many components #714

mgovers · 2024-09-06T11:58:53Z

Fixes very slow row-based deserialization cfr. #708 (comment)

NOTE: columnar deserialization is still potentially slow due to very many parse_skip calls when a certain column is not present

Issue was introduced in #708

A couple issues are not yet explained and also need further investigation but are out of scope of the immediate regression mitigation:

why does the python deserializer end up in this edge case? does it call parse multiple times on different components?
how can we improve the efficiency of the deserializer when the columnar dataset contains few columns?

Signed-off-by: Martijn Govers <[email protected]>

sonarqubecloud · 2024-09-06T12:12:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

TonyXiang8787 · 2024-09-06T12:20:33Z

@mgovers

I put this PR on-hold. After running benchmark on my side in Python, I do not see a significant performance difference between 1.9.35 and 1.9.34. The benchmark is using your script in WSL2.

1.9.35

done generating
serializing: tmp/data/all_random_compact.msgpack
done serializing: tmp/data/all_random_compact.msgpack
timed msgpack_serialize_to_file(tmp/data/all_random_compact.msgpack): 0.5882223630032968 s
deserializing: tmp/data/all_random_compact.msgpack
done deserializing: tmp/data/all_random_compact.msgpack
timed msgpack_deserialize_from_file(tmp/data/all_random_compact.msgpack): 1.1101793640118558 s
serializing: tmp/data/all_random.msgpack
done serializing: tmp/data/all_random.msgpack
timed msgpack_serialize_to_file(tmp/data/all_random.msgpack): 1.023680468002567 s
deserializing: tmp/data/all_random.msgpack
done deserializing: tmp/data/all_random.msgpack
timed msgpack_deserialize_from_file(tmp/data/all_random.msgpack): 2.2592216260090936 s
serializing: tmp/data/all_random_compact.json
done serializing: tmp/data/all_random_compact.json
timed json_serialize_to_file(tmp/data/all_random_compact.json): 10.663685843988787 s
deserializing: tmp/data/all_random_compact.json
done deserializing: tmp/data/all_random_compact.json
timed json_deserialize_from_file(tmp/data/all_random_compact.json): 8.332203558995388 s
serializing: tmp/data/all_random.json
done serializing: tmp/data/all_random.json
timed json_serialize_to_file(tmp/data/all_random.json): 13.262728639005218 s
deserializing: tmp/data/all_random.json
done deserializing: tmp/data/all_random.json
timed json_deserialize_from_file(tmp/data/all_random.json): 13.017963435995625 s

1.9.34

done generating
serializing: tmp/data/all_random_compact.msgpack
done serializing: tmp/data/all_random_compact.msgpack
timed msgpack_serialize_to_file(tmp/data/all_random_compact.msgpack): 0.6074205159966368 s
deserializing: tmp/data/all_random_compact.msgpack
done deserializing: tmp/data/all_random_compact.msgpack
timed msgpack_deserialize_from_file(tmp/data/all_random_compact.msgpack): 1.1239508260041475 s
serializing: tmp/data/all_random.msgpack
done serializing: tmp/data/all_random.msgpack
timed msgpack_serialize_to_file(tmp/data/all_random.msgpack): 1.10275221397751 s
deserializing: tmp/data/all_random.msgpack
done deserializing: tmp/data/all_random.msgpack
timed msgpack_deserialize_from_file(tmp/data/all_random.msgpack): 2.401370646985015 s
serializing: tmp/data/all_random_compact.json
done serializing: tmp/data/all_random_compact.json
timed json_serialize_to_file(tmp/data/all_random_compact.json): 10.434441459015943 s
deserializing: tmp/data/all_random_compact.json
done deserializing: tmp/data/all_random_compact.json
timed json_deserialize_from_file(tmp/data/all_random_compact.json): 8.65973894900526 s
serializing: tmp/data/all_random.json
done serializing: tmp/data/all_random.json
timed json_serialize_to_file(tmp/data/all_random.json): 13.544286515010754 s
deserializing: tmp/data/all_random.json
done deserializing: tmp/data/all_random.json
timed json_deserialize_from_file(tmp/data/all_random.json): 12.847745289996965 s

TonyXiang8787 · 2024-09-06T12:33:14Z

@mgovers I also run the script in Windows. There is no significant performance difference.

1.9.35

done generating
serializing: tmp\data\all_random_compact.msgpack
done serializing: tmp\data\all_random_compact.msgpack
timed msgpack_serialize_to_file(tmp\data\all_random_compact.msgpack): 0.8647662999574095 s
deserializing: tmp\data\all_random_compact.msgpack
done deserializing: tmp\data\all_random_compact.msgpack
timed msgpack_deserialize_from_file(tmp\data\all_random_compact.msgpack): 2.770424999995157 s
serializing: tmp\data\all_random.msgpack
done serializing: tmp\data\all_random.msgpack
timed msgpack_serialize_to_file(tmp\data\all_random.msgpack): 1.6302193999290466 s
deserializing: tmp\data\all_random.msgpack
done deserializing: tmp\data\all_random.msgpack
timed msgpack_deserialize_from_file(tmp\data\all_random.msgpack): 5.407391499960795 s
serializing: tmp\data\all_random_compact.json
done serializing: tmp\data\all_random_compact.json
timed json_serialize_to_file(tmp\data\all_random_compact.json): 29.098121999995783 s
deserializing: tmp\data\all_random_compact.json
done deserializing: tmp\data\all_random_compact.json
timed json_deserialize_from_file(tmp\data\all_random_compact.json): 15.278849299997091 s
serializing: tmp\data\all_random.json
done serializing: tmp\data\all_random.json
timed json_serialize_to_file(tmp\data\all_random.json): 37.63965840009041 s
deserializing: tmp\data\all_random.json
done deserializing: tmp\data\all_random.json
timed json_deserialize_from_file(tmp\data\all_random.json): 21.915195000125095 s

1.9.34

done generating
serializing: tmp\data\all_random_compact.msgpack
done serializing: tmp\data\all_random_compact.msgpack
timed msgpack_serialize_to_file(tmp\data\all_random_compact.msgpack): 0.9519388000480831 s
deserializing: tmp\data\all_random_compact.msgpack
done deserializing: tmp\data\all_random_compact.msgpack
timed msgpack_deserialize_from_file(tmp\data\all_random_compact.msgpack): 2.7140329000540078 s
serializing: tmp\data\all_random.msgpack
done serializing: tmp\data\all_random.msgpack
timed msgpack_serialize_to_file(tmp\data\all_random.msgpack): 1.7673208001069725 s
deserializing: tmp\data\all_random.msgpack
done deserializing: tmp\data\all_random.msgpack
timed msgpack_deserialize_from_file(tmp\data\all_random.msgpack): 5.330735600087792 s
serializing: tmp\data\all_random_compact.json
done serializing: tmp\data\all_random_compact.json
timed json_serialize_to_file(tmp\data\all_random_compact.json): 44.06123959994875 s
deserializing: tmp\data\all_random_compact.json
done deserializing: tmp\data\all_random_compact.json
timed json_deserialize_from_file(tmp\data\all_random_compact.json): 18.021055599907413 s
serializing: tmp\data\all_random.json
done serializing: tmp\data\all_random.json
timed json_serialize_to_file(tmp\data\all_random.json): 56.36484510009177 s
deserializing: tmp\data\all_random.json
done deserializing: tmp\data\all_random.json
timed json_deserialize_from_file(tmp\data\all_random.json): 36.006855400046334 s

mgovers · 2024-09-06T13:21:14Z

Did you pip install for each version?

mgovers · 2024-09-09T06:49:15Z

I have no idea why but I also am not able to reproduce the performance regression in the Python package I found on Friday anymore; neither in a custom built package using editable mode, nor on the package pulled from PyPI.

I am, however, still able to reproduce the problem in which otherwise skipped components are terribly inefficient to parse.

TonyXiang8787 · 2024-09-09T06:59:32Z

@mgovers Since it does not affect current row-based deserialization. I hereby close this PR. You can continue to investigate the issue with parse_skip and columnar deserialization.

don't parse_skip() for very many components

bead18c

Signed-off-by: Martijn Govers <[email protected]>

mgovers requested a review from TonyXiang8787 September 6, 2024 11:59

mgovers added the bug Something isn't working label Sep 6, 2024

mgovers self-assigned this Sep 6, 2024

TonyXiang8787 approved these changes Sep 6, 2024

View reviewed changes

TonyXiang8787 enabled auto-merge September 6, 2024 12:03

TonyXiang8787 disabled auto-merge September 6, 2024 12:18

TonyXiang8787 marked this pull request as draft September 6, 2024 12:18

TonyXiang8787 closed this Sep 9, 2024

TonyXiang8787 deleted the feature/temporary-fix-deserializer branch September 9, 2024 06:59

mgovers mentioned this pull request Sep 9, 2024

more efficient columnar deserialization #716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't parse_skip() for very many components #714

don't parse_skip() for very many components #714

mgovers commented Sep 6, 2024 •

edited

Loading

sonarqubecloud bot commented Sep 6, 2024

TonyXiang8787 commented Sep 6, 2024

TonyXiang8787 commented Sep 6, 2024 •

edited

Loading

mgovers commented Sep 6, 2024

mgovers commented Sep 9, 2024

TonyXiang8787 commented Sep 9, 2024

don't parse_skip() for very many components #714

don't parse_skip() for very many components #714

Conversation

mgovers commented Sep 6, 2024 • edited Loading

sonarqubecloud bot commented Sep 6, 2024

Quality Gate passed

TonyXiang8787 commented Sep 6, 2024

1.9.35

1.9.34

TonyXiang8787 commented Sep 6, 2024 • edited Loading

1.9.35

1.9.34

mgovers commented Sep 6, 2024

mgovers commented Sep 9, 2024

TonyXiang8787 commented Sep 9, 2024

mgovers commented Sep 6, 2024 •

edited

Loading

TonyXiang8787 commented Sep 6, 2024 •

edited

Loading