Chunk file reading for large files #15

ameenmaali · 2020-06-10T00:17:20Z

There are some use cases that have been brought up for deduping large files (> 10gb). This will result in a crash if the system does not have enough RAM to deal with it, as the file is loaded into memory at this point. We will need to chunk the file into smaller buffers when loading in order to prevent this. It may also make sense to parallelize this process with the URL deduplication process as large files will take longer than necessary due to waiting for the entire file to be loaded. @larskraemer, any thoughts on approach for solving this issue?

larskraemer · 2020-06-10T00:29:44Z

In order to tackle this, we might need to rethink how we find duplicates.
Currently, we need to store all of the unique URLs, since Url stores the whole string.
Even if the url_string field is cleared after printing, the map in main holds a copy of the url_key, which we have to assume is similar in size to the original.
In order to be able to deal with really large files, we probably have to read one URL at a time, produce a unique hash of it and then free/reuse the memory. The hash could then be used to check for duplicates.

larskraemer · 2020-06-12T01:13:50Z

#17 reduces our maximum memory usage to 16 bytes per unique URL plus some constant amount. I don't think we can do much better, except in the constant amount part.

marcelo321 · 2020-06-16T23:02:44Z

Hello @larskraemer @ameenmaali I'm not good at coding but i really want this enchancement so i'm dropping an idea. Wouldn't be a good idea to split the file into smaller files, run urldedupe on each file and then mix it together? I'm actually doing that in a bash script to deal with this problem but it would be awesome if urldedupe itself could solve this problem instead.

larskraemer · 2020-06-16T23:09:20Z

That wouldn't help after #17 is merged, I don't think, since at the end, we need to keep all unique URLs in memory at once (or an identifier based on them). Would be nice to know if that commit helps in your case, could you build that branch and give it a try?

marcelo321 · 2020-06-30T17:06:54Z

@larskraemer What if you sort the file first, and the proccess it in parts? And at the end it loads everything in memory and that's it?

Btw how do I build that branch? do I have to delete the current version of urldedupe?

larskraemer · 2020-06-30T17:27:25Z

@marcelo321
git clone https://github.com/larskraemer/urldedupe.git
cd urldedupe
git checkout store_hashes
Then build as usual

Been a while since I looked at the code, but that version shouldn't have the memory issue, i.e. it uses only as much memory as is needed for the hashes of all unique URLs. If you have 8GB of RAM, you should be able to handle about 500 million unique URLs.
Sorting probably makes it worse, since sorting the amount of URLs it takes to cause memory issues would probably take many times longer than just throwing them at urldedupe.

If you're still having issues with that version, please report back so I can debug your issue.

marcelo321 · 2020-07-02T00:12:53Z

@larskraemer Oh nice. I just installed and testing it. I will let you know if i encounter any new problem.

ameenmaali added the enhancement New feature or request label Jun 10, 2020

larskraemer mentioned this issue Jun 10, 2020

Input using istream_iterator #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk file reading for large files #15

Chunk file reading for large files #15

ameenmaali commented Jun 10, 2020

larskraemer commented Jun 10, 2020

larskraemer commented Jun 12, 2020

marcelo321 commented Jun 16, 2020

larskraemer commented Jun 16, 2020

marcelo321 commented Jun 30, 2020

larskraemer commented Jun 30, 2020

marcelo321 commented Jul 2, 2020

Chunk file reading for large files #15

Chunk file reading for large files #15

Comments

ameenmaali commented Jun 10, 2020

larskraemer commented Jun 10, 2020

larskraemer commented Jun 12, 2020

marcelo321 commented Jun 16, 2020

larskraemer commented Jun 16, 2020

marcelo321 commented Jun 30, 2020

larskraemer commented Jun 30, 2020

marcelo321 commented Jul 2, 2020