Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issue for large WACZ/WARC files on a remote location #22

Closed
wants to merge 1 commit into from

Conversation

leewesleyv
Copy link
Collaborator

@leewesleyv leewesleyv commented Nov 18, 2024

Resolves #21

  • Make sure the tmpfile is removed at the end

Extract the WARC data instead of streaming data which was a bottleneck for remote WACZ files. Ran this on a 2.5 GB WACZ file with a spider that makes 20266 requests and the elapsed time came out at ~591 seconds. Whereas this will currently take around 20266 * 110 seconds. However, this does introduce some delay at the start since it extracts the WARC information into a local tmp file.

@leewesleyv leewesleyv changed the title (#21) Fix performance issue for large WACZ/WARC files on a remote location Fix performance issue for large WACZ/WARC files on a remote location Nov 18, 2024
@wvengen
Copy link
Member

wvengen commented Nov 18, 2024

Super, great idea to just cache the index 👍
I've experimented in the past with downloading the whole file first, but only the index seems much more sensible!

@leewesleyv leewesleyv closed this Nov 26, 2024
@leewesleyv leewesleyv deleted the fix/21-warc-lookup-performance-remote-wacz branch January 10, 2025 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of get_warc_from_cdxj_record
2 participants