I have a system with roughly a 100 million documents, and I'd like to keep track of their modifications between mirrors. In order to exchange information about modifications effectively, I want to send information about modified documents by days, not by each separate document. Something like this:
[ 2012/03/26, cs26],
[ 2012/03/25, cs25],
[ 2012/03/24, cs24],
...
where each cs is the checksum of timestamps of all documents created on a particular day.
Now, the problem I'm running into is that I don't know of an algorithm that could "subtract" data from the checksum when a document is being deleted. None of the cryptographic hashes fit the need, for obvious reasons, and I couldn't find any algorithms for CRC that would do this.
One option I considered was to have deletes add extra information to the hash, but this would lead to even more problems, as nodes can receive delete requests in different order, and when a node would restart it would re-read all the timestamps from the documents, and thus the information about the deletes would be lost.
I also wouldn't like using a hash tree with all document hashes in-memory, as that would use roughly 8 gigs of memory, and I think it's a bit of overkill for just this need.
For now the best option seems to regenerate these hashes completely from time to time in background, but that is also a lot of needless overhead, and wouldn't provide immediate information on changes.
So, do you guys know of a checksum algorithm that would let me "remove" some data from the checksum? I need the algorithm to be somewhat fast and the checksum that would strongly indicate the smallest of changes (that's why I can't really use plain XOR).
Or maybe you have better ideas about the whole design?