File deduplication

Is the MD5 hash beneficial?

Rationale

Calculating the MD5 hash seems to keep a processor core at or near 100% utilization during execution of the program. It is possible that the usage of the hash does not benefit execution time. Points to consider include

Hash calculation requires reading the entire file.
Byte by byte comparison may require reading only a few bytes of the file.
Hash need only be calculated once
Hash calculation requires CPU.
Byte by byte comparison will require reading entirety of both files (and is always required when hashes match.)

Test

Data sets

Images

Testing was performed on a range of systems using two data sets. One data set is a directory tree where camera images are stored. My normal processing regime is to import pictures by date ( e.g .../2018/05/09/) and then copy the images to a project directory (e.g. .../2018-05-09_Project_Name/.) This results in at least two copies of each picture file. Further sorting and manipulation may result in additional copies.

Backups

A second data set is from my backups. My backup strategy is to copy to .../user/machine/date where user is my (or my wife's) user name and the machine is the host that is being backed up. Date is the format of either nn where nn is the day of the month except the first day of the month which will be yyyy-mm-dd. In other words, the first day of the month is preserved forever (or until manually deleted) and other days of the month are overwritten. User directories such as ~/Documents and ~/bin are copied under these daily directories once each day. Backup is performed using rsync and except for the first day of the month, is incremental relative to the first day of the month by using the --link-dest option. This results in many hard links in the backup data set. However if the backup data set is copied to another location using rsync, the hard links will be replaced with duplicated files. (There is an rsync option to preserve hard links but it is not feasible to use on large data sets.) The backup data set included files that had many more duplicates than the image data set and reduced in disk usage more as a result.

Both data sets were copied to the test location using rsync and without the option to preserve hard links.

Test systems

Several test systems were employed running from low to high capabilities for both processor, RAM memory and storage subsystems. In general the all three capabilities matched. (In other words, a test with a low capacity processor also had less RAM and a slower hard drive.)

tenere (small server, no GUI)

D410 Atom CPU
4 GB RAM
200 GB Seagate Barracuda (7200.7) (EXT4)
Debian GNU/Linux 9

grandidier (server with GUI, usually operated headless)

AMD Phenom II X4 820
8 GB RAM
10.9 TB S/W RAID of 6 x 2TB drives (ZFS RAIDZ2)
Ubuntu 16.04.4

yggdrasil (Lenovo laptop)

Intel I7-4710HQ
16 GB RAM
1 TB Samsung EVO 850 (EXT4)
Debian GNU/Linux 9

olive (desktop)

Intel I7-4770K
32 GB RAM
2 TB H/W RAID (4 x 500 GB Samsung EVO 850 in RAID0 on LSI 9266-8i card, EXT4)
Debian GNU/Linux 9

Test procedure

The git repo was branched and a command line option added to skip the calculation of the MD5 hash. In other words if any files matched length then the byte by byte comparison was performed. This test compares bytes as the files are read from disk and either returns 'unmatched' when the first non-matching byte is found or 'matched' if all bytes match. Tests were timed using the time command with and without the hash calculation. Tests were generally run twice for each data set. Some test results seemed questionable and others were accidentally not performed.

Results

The table below shows the percent change for dropping the hash calculation for several tests. Tests that increased the times when the hash was dropped are highlighted in red (except for those that seemed questionable.) What is not displayed is the time difference between the lowest power host compared to the higher powered host. For the backup data set (which was much larger than the image directory) which ranged from over ten hours to less than half an hour.

Tests where not performing the MD5 sum reduced performance are highlighted.

test result numbers

It seems evident that the lower resource systems benefited more by using the MD5 hash. The only measurement that benefited from no hash was the user time. Real time (as on a wall clock) was reduced for these systems despite the increased CPU load for calculating the hash. The higher powered systems benefited across the board from not calculating the hash.

Conclusion

Making the MD5 hash optional will be merged into the main stream code because eliminating it is not universally beneficial. The flag to control the hash will be modified to not calculate the hash by default. That seems to be beneficial in more situations than calculating the hash. When running on low resource systems the hash can be included for the resulting benefit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly