Why are conditionally executed instructions not pr

2019-03-18 01:17发布

问题:

Naively, conditionally executed instructions seem like a great idea to me.

As I read more about ARM (and ARM-like) instruction sets (Thumb2, Unicore, AArch64) I find that they all lack the bits for conditional execution.

Why is conditional execution missing from each of these?

Was conditional execution a mistake at the time, or have subsequent changes made it an expensive waste of instruction bits?

回答1:

General claim is modern systems have better branch predictors and compilers are much more advanced so their cost on instruction encoding space is not justified.

This is from ARMv8 Instruction Set Overview

The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.

And it continues

A very small set of “conditional data processing” instructions are provided. These instructions are unconditionally executed but use the condition flags as an extra input to the instruction. This set has been shown to be beneficial in situations where conditional branches predict poorly, or are otherwise inefficient.

Another paper titled Trading Conditional Execution for More Registers on ARM Processors claims:

... conditional execution takes up precious instruction space as conditions are encoded into a 4-bit condition code selector on every 32-bit ARM instruction. Besides, only small percentages of instructions are actually conditionalized in modern embedded applications, and conditional execution might not even lead to performance improvement on modern embedded processors.



回答2:

One of the reasons is that because of encoding.

In thumb, you cannot squeeze more 4 bits into the tight 16-bit space while there isn't even enough room for the 3 high bit of the registers and they must be reduced to a subset of only 8 registers. Note that in thumb2 you have a separate IT(E) instruction for selecting the conditions for the next 4 instructions. You can't store the condition in the same instruction though, because of the reason stated above.

For AArch64 the number of registers has been doubled compared to 32-bit ARM, but again you don't have any remaining bits for the new 3 high bits of the registers. If you want to use the old encoding then you must "borrow" either from the narrow 12-bit immediate or the 4-bit condition. 12-bit is too small compared to other RISC architectures such as MIPS and reducing it making everything worse, so removing the condition is a better choice. Because branch prediction has become more and more advanced, it won't be much a problem



回答3:

Conditional execution is a good choice in implementation of many auxiliary or bit-twiddling routines, such as sorting, list or tree manipulation, number to string conversion, sqrt or long division. We could add UART drivers and extracting bit fields in routers. Those have a high branch to non-branch ratio with somewhat high unpredictability too.

However, once you get beyond the lowest level of services (or increase the abstraction level by using a higher level language), the code looks completely different: code blocks inside different branches of conditions consists more of moving data and calling sub-routines. Here the benefits of those extra 4 bits rapidly fade away. It's not only personal development but cultural: Culturally programming has grown from unstructured (Basic, Fortran, Assembler) towards structural. Different programming paradigms are supported better also in different instruction set architectures.

A technological compromise could have been the possibility to compress the five bit 'cond.S' field to four or three most frequently used combinations.



回答4:

It's somewhat misleading to say that conditional execution is not present in ARMv8. The issue is to understand why you don't want to execute some instructions. Perhaps in the early ARM days, the actual non-execution of instructions mattered (for power or whatever) but today the significance of this feature is that it allows you to avoid branches for small dumb jumps, for example code like a=(b>0? 1: 2). This sort of thing is more common than you might imagine --- conceptually it's things like MAX/MIN or ABS (though for some CPUs there may be instructions to do these particular tasks).

In ARMv8, while there are not general conditionally executed instructions there are a few instructions that perform the specific task I am describing, namely allowing you to avoid branching for short dumb jumps; CSEL is the most obvious example, though there are other cases (e.g. conditional setting of conditions) to handle other common patterns (in that case the pattern of C short-circuited expression evaluation).

IMHO what ARM has done here is what makes the most sense. They've extracted the feature of conditional execution that remains valuable on modern CPUs (avoid many branches) while changing the details of the implementation to match the micro-architecture of modern CPUs.



回答5:

On the old ARM v4, the conditional instructions only saved time if there was a high probability that they would end up getting executed, or if the probability was about 50%, then if there were just 2 to 4 of them in a row. If they weren't getting executed, then it was wasting cycles to have to fetch past them, versus the overhead of using a branch to get past them. If they were being executed, the branch would be fetched but not executed.

A minor nuisance is that when debugging, placing a break on a conditional instruction always resulted in taking a break on that instruction, regardless of the condition (unless there's some really smart debugger that my company didn't have).



回答6:

"Why are conditionally executed instructions not present ..." "Was conditional execution a mistake at the time, or have subsequent changes made it an expensive waste of instruction bits?"

[Biased Reply, Author associated with the idea (and a proponent of it)]

Design choice. Add features and you expand your possibilities; remove them and you can implement workarounds, or do without.

Cost, TDP, and Patents, even the technical skill level necessary to develope a competing product all come into play.

There are billions on the table, when you roll those die you don't want to roll the whole Business too, so they choose not to.

In my opinion it is an error to omit elison ability in a Server Processor getting pinned to a Fabric, but I am biased.

It is not a mistake (to have it, well implemented), it is expensive, it's not a waste (the bits are smart and mind their own business).

It is like any CPU Extension, or adding a GPU; if you can make skillful use of your Tools then your good to go, otherwise pack light.

Wikipedia Quote: "According to different benchmarks, TSX can provide around 40% faster applications execution in specific workloads, and 4–5 times more database transactions per second (TPS).".

It's 'costly' (for some situations) but important for the current style of programming, or more pessimistically a means to score far higher in synthetic Benchmarks.

Someday it will be as easy as Lego, and you will be able to ask 'it' to assemble itself and do your biddingask; until then the Processor MUST support Programmer (and Compiler writers) laziness - thus the rarity of Programs that can run mostly on the GPU (but we are getting there).

Therefore, the removal of (great) features that are thought unwanted or were not implemented in a cost effective and competitive manner.

Thus, TSX rules currently; but ARM CPUs need fancy Threads for their Fabric too.

URL References:

AMD: https://en.wikipedia.org/wiki/Advanced_Synchronization_Facility

Intel: https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

.



回答7:

Just like, the defer slot in mips being a trick (at the time), conditional execution in arm is a trick (at the time), as is the pc being two instructions ahead. Now down the road how much affect do they have? Will ARMs branch predictor actually make that much difference or is the real answer they needed more bits in a 32 bit instruction word and like thumb the first and easiest thing to get rid of is the condition bits.

it is not too difficult to do some performance tests to see how good or back the branch predictor really is, I tried it with unconditional branches on an arm11, granted that is an old architecture now but still in wide use. It was difficult at best to get the branch prediction to show any improvement, and in no way, shape, or form could it compete with the conditional execution. I have not repeated these experience on anything in the cortex-a family.