Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk file reading for large files #15

Open
ameenmaali opened this issue Jun 10, 2020 · 7 comments
Open

Chunk file reading for large files #15

ameenmaali opened this issue Jun 10, 2020 · 7 comments
Labels
enhancement New feature or request

Comments

@ameenmaali
Copy link
Owner

There are some use cases that have been brought up for deduping large files (> 10gb). This will result in a crash if the system does not have enough RAM to deal with it, as the file is loaded into memory at this point. We will need to chunk the file into smaller buffers when loading in order to prevent this. It may also make sense to parallelize this process with the URL deduplication process as large files will take longer than necessary due to waiting for the entire file to be loaded. @larskraemer, any thoughts on approach for solving this issue?

@ameenmaali ameenmaali added the enhancement New feature or request label Jun 10, 2020
@larskraemer
Copy link
Contributor

In order to tackle this, we might need to rethink how we find duplicates.
Currently, we need to store all of the unique URLs, since Url stores the whole string.
Even if the url_string field is cleared after printing, the map in main holds a copy of the url_key, which we have to assume is similar in size to the original.
In order to be able to deal with really large files, we probably have to read one URL at a time, produce a unique hash of it and then free/reuse the memory. The hash could then be used to check for duplicates.

@larskraemer
Copy link
Contributor

#17 reduces our maximum memory usage to 16 bytes per unique URL plus some constant amount. I don't think we can do much better, except in the constant amount part.

@marcelo321
Copy link

Hello @larskraemer @ameenmaali I'm not good at coding but i really want this enchancement so i'm dropping an idea. Wouldn't be a good idea to split the file into smaller files, run urldedupe on each file and then mix it together? I'm actually doing that in a bash script to deal with this problem but it would be awesome if urldedupe itself could solve this problem instead.

@larskraemer
Copy link
Contributor

That wouldn't help after #17 is merged, I don't think, since at the end, we need to keep all unique URLs in memory at once (or an identifier based on them). Would be nice to know if that commit helps in your case, could you build that branch and give it a try?

@marcelo321
Copy link

@larskraemer What if you sort the file first, and the proccess it in parts? And at the end it loads everything in memory and that's it?

Btw how do I build that branch? do I have to delete the current version of urldedupe?

@larskraemer
Copy link
Contributor

@marcelo321
git clone https://github.com/larskraemer/urldedupe.git
cd urldedupe
git checkout store_hashes
Then build as usual

Been a while since I looked at the code, but that version shouldn't have the memory issue, i.e. it uses only as much memory as is needed for the hashes of all unique URLs. If you have 8GB of RAM, you should be able to handle about 500 million unique URLs.
Sorting probably makes it worse, since sorting the amount of URLs it takes to cause memory issues would probably take many times longer than just throwing them at urldedupe.

If you're still having issues with that version, please report back so I can debug your issue.

@marcelo321
Copy link

@larskraemer Oh nice. I just installed and testing it. I will let you know if i encounter any new problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants