I\'m writing a program that needs to test if a file is equivalent to one or more
ID: 639425 • Letter: I
Question
I'm writing a program that needs to test if a file is equivalent to one or more other files. To accomplish this, every time we see a new file we stat the file and get the size.
We use the size as a key and if we've never seen the size before than we know that's a new file.
If we have seen that size before we were planning to md5 the first 4k and last 4k of the target file and check to see if that hash has been seen in list associated with the size key.
I don't want to hash the whole file since these files can be rather large (largest I've seen so far is 90G).
The goal is to avoid spending a lot of time on redundant file comparisons. Will this algorithm work, and can it be improved?
More details on my particular problem: I'm attempting to deduplicate a decently sized set of data ( 2Pb ) that contains a large amount of time stamped files (about 40%) created from syslog on FreeBSD machines. Before I start chewing through that many files line by line I wanted to make sure that the file I was looking at hadn't been seen before.
Explanation / Answer
It sounds like you are trying to optimize the file comparison because that can be a potentially expensive operation:
If two files have different sizes, they must be different files.
If the first and last 4K of two files hash to different values, they must be different files. The first portion will check stuff like a file identifier commonly included in the first few bytes, while the last would help catch cases where a file is being appended to (e.g. a log file), similar to what tail checks.
Next you would have to compare the entire file just to be sure. At this point it might makes sense to store multiple hash values (MD5 if security is not a concern, SHA2, etc). You should be able to find a way to read in the file once and feed it to multiple hash algorithms. Then you store a data structure with all of the hashes, which you can very quickly compare with other data structures for other files.
If all of those tests pass (files likely but not necessarily equal) then you might need to perform a full file comparison.
I think your algorithm sounds reasonable, and I think my minor additions will help.
Based on the clarifications in the question, I think your approach will work. The odds that log files are the same size and the first and last portions hash the same is extremely low. If you use a strong hash algorithm with a large output size (compared to MD5), the chance is even lower: with SHA-512, the internal state and output range are ginormous. Given we are talking about log files that are very likely to have date/time stamps at the start of each line, the input should have enough entropy to make this a non-issue.