-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk file reading for large files #15
Comments
In order to tackle this, we might need to rethink how we find duplicates. |
#17 reduces our maximum memory usage to 16 bytes per unique URL plus some constant amount. I don't think we can do much better, except in the constant amount part. |
Hello @larskraemer @ameenmaali I'm not good at coding but i really want this enchancement so i'm dropping an idea. Wouldn't be a good idea to split the file into smaller files, run urldedupe on each file and then mix it together? I'm actually doing that in a bash script to deal with this problem but it would be awesome if urldedupe itself could solve this problem instead. |
That wouldn't help after #17 is merged, I don't think, since at the end, we need to keep all unique URLs in memory at once (or an identifier based on them). Would be nice to know if that commit helps in your case, could you build that branch and give it a try? |
@larskraemer What if you sort the file first, and the proccess it in parts? And at the end it loads everything in memory and that's it? Btw how do I build that branch? do I have to delete the current version of urldedupe? |
@marcelo321 Been a while since I looked at the code, but that version shouldn't have the memory issue, i.e. it uses only as much memory as is needed for the hashes of all unique URLs. If you have 8GB of RAM, you should be able to handle about 500 million unique URLs. If you're still having issues with that version, please report back so I can debug your issue. |
@larskraemer Oh nice. I just installed and testing it. I will let you know if i encounter any new problem. |
There are some use cases that have been brought up for deduping large files (> 10gb). This will result in a crash if the system does not have enough RAM to deal with it, as the file is loaded into memory at this point. We will need to chunk the file into smaller buffers when loading in order to prevent this. It may also make sense to parallelize this process with the URL deduplication process as large files will take longer than necessary due to waiting for the entire file to be loaded. @larskraemer, any thoughts on approach for solving this issue?
The text was updated successfully, but these errors were encountered: