I am trying to find a collision between two messages that will lead to the same CRC hash. Considering I am using CRC32, is there any way I can shorten the list of possible messages I have to try when doing a brute force attack?
Any links to websites with hints on this will be helpful. I already have a brute force algorithm that will do this but it simply increment integers and sees if it will match other hashes.
spoof does exactly that. It does not require brute force.
It depends entirely on what you mean by "message". If you can append four bytes of gibberish to one of the messages. (I.E. four bytes that have no meaning within the context of the message.) Then it becomes trivial in the truest sense of the word.
Thinking in terms of bits moving through the CRC32 state machine.
CRC32 is based on a galois feedback shift register, each bit in its state will be replaced with the induction of 32 bits from the payload data. At the induction of each bit, the positions indicated by the polynomial will be exclusive ored with the sequence observed from the end of the Shift register. This sequence is not influenced by the input data until the shift register has been filled.
As an example, imagine we have a shift register filled with initial state 10101110, polynomial 10000011, and filling with unknown bits, X.
The feedback isn't in terms of X until the SR has been filled! So in order to generate a message with a predetermined checksum, you take your new message, generate it's CRC and work out it's next 32 bits of feedback. This you can do in 32 steps of the CRC function. You then need to calculate the effect this feedback has on the contents of the shift register.
A shortcut for doing this is to pad your message with four zero bytes and then look at the checksum. (Checksum is the state of the SR at the end, which if padded with four zero bytes is the influence of the feedback and the empty bytes.)
Exclusive OR that influence with the checksum value you want, replace the four byte trailer with that computed value and regenerate the checksum. You could do this with any program that generates CRC32, a hex editor, and a calculator that can handle hex.
If you want to generate two messages that both make complete sense and don't contain trailing garbage, things get a little harder. Identify a number of sections that you can write plausible alternatives, with exactly the same length.
Using english prose as an example. "I think that this can work" and "I believe in this approach" Have broadly similar meanings, and exactly the same length.
Identifying enough examples in your message is the tricky bit (Unless you want to cheat with whitespace!) CRC 32 is linear, provided the data has the correct offset within the message. So CRC([messagea][padding])^CRC([padding][messageb])=CRC([messagea][messageb]) There are some caveats with word alignment that you'll need to cope with, as a general hint, you want to extend the passages out into the "fixed" parts of the message. As a general rule you want to have alternatives for n*1.5 passages, where n is the size of the CRC.
You can now calculate the CRC that the skeletal message has, the impression that each alternative passage would have on it, and then draw up a table comparing the influence that each alternative for each passage would have. You then need to select alternatives that will modify the skeletal CRC to match the CRC you want. That problem is actually quite fun to solve, First off find any alternatives that uniquely modify a bit, if that bit needs to change for your CRC, select that alternative and fold it's influence into the CRC, then go round again. That should reduce the solution space that you then need to search.
That's quite a tough thing to code up, but it would generate your collisions in a very short time span.
Short of a flaw with my calculus, the probability of not having found one collision after N trials is approximated in the following table:
In other words, the probability of having to compute more than 200,000 CRC32 values before finding a duplicate is less than 1%, or, the probability of finding a duplicate before 102,000 attempts is 70.2%
BTW this is remarkable because the probability of finding one collision on, say, the very 200,000th attempt is still in the order of 1/1000th of 1% ((4M - 200,0000) / 4M), but the probably to have found one collision before the 200,000th attempt is a quasi certainty (well, above 99% anyway). This shows the interest of keeping a database of CRCs computed so far.
We could certainly spend some time studying the CRC32 algorithm and its underlying mathematics, in an attempt to find messages more likely to produce a CRC32 collisions, but the relatively small number of truly random attempts required to find at least one collision with quasi certainty, makes this kind of cryptanalysis approach hardly worth the effort. For example assuming that we could discover a way to select messages that are 10 times more likely to collide with one another, we'd still have to try in the order of 63,000 times before reaching the 99% chances of having at least one collision (better than 200,000 but, still requiring roughly the same type of application.)
The only thing we may want to consider, in this area, is to avoid messages shorter than 4 bytes in length (I read somewhere that CRC32 was bijective in this message space), and to avoid messages that are too similar (i.e. only differing by one or two characters), since after CRC32's original purpose is to detect (and possibly auto-correct) such small differences in messages.
Therefore, it appears that the difficulty of the assignment is not so much that of finding ways of computing CRC32s at sizzling speed (though we shouldn't be too slow with this either), but rather to manage a quickly-searcheable database of up to 200,000 Messages (or message "key", more on this below) and their associated CRC32 value.
A few ideas to implement all this
Just yesterday there was this question here on SO, a couple of the pointers mentioned there may help you.
Brute force you need about sqrt(6N) random length messages for a hash of size N to get a 95% probability for collision. E.g. CRC32 , N = 2^32 , you need about 160 000 messages
I will assume that you mean "message" instead of "key".
If you are allowed to choose both "keys", then brute-force would be rather quick anyway because of the birthday paradox. Pick random messages, compute their CRC, remember all of them and the associated CRC, and each new one has more and more chances to collide with an existing one as they accumulate. Frankly, I expect this approach to be faster on a modern computer than looking up known approaches to make CRC32 collide.