Naively, conditionally executed instructions seem like a great idea to me.
As I read more about ARM (and ARM-like) instruction sets (Thumb2, Unicore, AArch64) I find that they all lack the bits for conditional execution.
Why is conditional execution missing from each of these?
Was conditional execution a mistake at the time, or have subsequent changes made it an expensive waste of instruction bits?
Conditional execution is a good choice in implementation of many auxiliary or bit-twiddling routines, such as sorting, list or tree manipulation, number to string conversion, sqrt or long division. We could add UART drivers and extracting bit fields in routers. Those have a high branch to non-branch ratio with somewhat high unpredictability too.
However, once you get beyond the lowest level of services (or increase the abstraction level by using a higher level language), the code looks completely different: code blocks inside different branches of conditions consists more of moving data and calling sub-routines. Here the benefits of those extra 4 bits rapidly fade away. It's not only personal development but cultural: Culturally programming has grown from unstructured (Basic, Fortran, Assembler) towards structural. Different programming paradigms are supported better also in different instruction set architectures.
A technological compromise could have been the possibility to compress the five bit 'cond.S' field to four or three most frequently used combinations.
It's somewhat misleading to say that conditional execution is not present in ARMv8. The issue is to understand why you don't want to execute some instructions. Perhaps in the early ARM days, the actual non-execution of instructions mattered (for power or whatever) but today the significance of this feature is that it allows you to avoid branches for small dumb jumps, for example code like a=(b>0? 1: 2). This sort of thing is more common than you might imagine --- conceptually it's things like MAX/MIN or ABS (though for some CPUs there may be instructions to do these particular tasks).
In ARMv8, while there are not general conditionally executed instructions there are a few instructions that perform the specific task I am describing, namely allowing you to avoid branching for short dumb jumps; CSEL is the most obvious example, though there are other cases (e.g. conditional setting of conditions) to handle other common patterns (in that case the pattern of C short-circuited expression evaluation).
IMHO what ARM has done here is what makes the most sense. They've extracted the feature of conditional execution that remains valuable on modern CPUs (avoid many branches) while changing the details of the implementation to match the micro-architecture of modern CPUs.
On the old ARM v4, the conditional instructions only saved time if there was a high probability that they would end up getting executed, or if the probability was about 50%, then if there were just 2 to 4 of them in a row. If they weren't getting executed, then it was wasting cycles to have to fetch past them, versus the overhead of using a branch to get past them. If they were being executed, the branch would be fetched but not executed.
A minor nuisance is that when debugging, placing a break on a conditional instruction always resulted in taking a break on that instruction, regardless of the condition (unless there's some really smart debugger that my company didn't have).
Just like, the defer slot in mips being a trick (at the time), conditional execution in arm is a trick (at the time), as is the pc being two instructions ahead. Now down the road how much affect do they have? Will ARMs branch predictor actually make that much difference or is the real answer they needed more bits in a 32 bit instruction word and like thumb the first and easiest thing to get rid of is the condition bits.
it is not too difficult to do some performance tests to see how good or back the branch predictor really is, I tried it with unconditional branches on an arm11, granted that is an old architecture now but still in wide use. It was difficult at best to get the branch prediction to show any improvement, and in no way, shape, or form could it compete with the conditional execution. I have not repeated these experience on anything in the cortex-a family.
General claim is modern systems have better branch predictors and compilers are much more advanced so their cost on instruction encoding space is not justified.
This is from ARMv8 Instruction Set Overview
And it continues
Another paper titled Trading Conditional Execution for More Registers on ARM Processors claims:
One of the reasons is that because of encoding.
In thumb, you cannot squeeze more 4 bits into the tight 16-bit space while there isn't even enough room for the 3 high bit of the registers and they must be reduced to a subset of only 8 registers. Note that in thumb2 you have a separate IT(E) instruction for selecting the conditions for the next 4 instructions. You can't store the condition in the same instruction though, because of the reason stated above.
For AArch64 the number of registers has been doubled compared to 32-bit ARM, but again you don't have any remaining bits for the new 3 high bits of the registers. If you want to use the old encoding then you must "borrow" either from the narrow 12-bit immediate or the 4-bit condition. 12-bit is too small compared to other RISC architectures such as MIPS and reducing it making everything worse, so removing the condition is a better choice. Because branch prediction has become more and more advanced, it won't be much a problem