I have made a lookup table that allows you to blend two single-byte channels (256 colors per channel) using a single-byte alpha channel using no floating point values (hence no float to int conversions). Each index in the lookup table corresponds to the value of 256ths of a channel, as related to an alpha value.
In all, to fully calculate a 3-channel RGB blend, it would require two lookups into the array per channel, plus an addition. This is a total of 6 lookups and 3 additions. In the example below, I split the colors into separate values for ease of demonstration. This example shows how to blend three channels, R G and B by an alpha value ranging from 0 to 256.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value
BYTE rem = 255 - av; // Remaining fraction
rDest = _lookup[r1][rem] + _lookup[r2][av];
gDest = _lookup[g1][rem] + _lookup[g2][av];
bDest = _lookup[b1][rem] + _lookup[b2][av];
It works great. Precise as you can get using 256 color channels. In fact, you would get the same exact values using the actual floating point calculations. The lookup table was calculated using doubles to begin with. The lookup table is too big to fit in this post (65536 bytes). (If you would like a copy of it, email me at ten.turtle.toes@gmail.com, but don't expect a reply until tomorrow because I am going to sleep now.)
So... what do you think? Is it worth it or not?
Mixing colors is computationally trivial. I would be surprised if this yielded a significant benefit, but as always, you must first prove that this calculation is a bottleneck in performance in the first place. And I would suggest that a better solution is to use hardware accelerated blending.
I would be interested in seeing some benchmarks.
There is an algorithm that can do perfect alpha blending without any floating point calculations or lookup tables. You can find more info in the following document (the algorithm and code is described at the end)
I also did an SSE implementation of this long ago, if you are interested...
Today's processors can do a lot of calculation in the time it takes to get one value from memory, especially if it's not in cache. That makes it especially important to benchmark possible solutions, because you can't easily reason what the result will be.
I don't know why you're concerned about floating point conversions, this can all be done as integers.
If you want to get really clever, you can replace the divide with a multiply followed by a right-shift.
Edit: Here's the version using a right-shift. Adding a constant might reduce any truncation errors, but I'll leave that as an exercise for the reader.