I have a basic question about assembly.
Why do we bother doing arithmetic operations only on registers if they can work on memory as well?
For example both of the following cause (essentially) the same value to be calculated as an answer:
Snippet 1
.data
var dd 00000400h
.code
Start:
add var,0000000Bh
mov eax,var
;breakpoint: var = 00000B04
End Start
Snippet 2
.code
Start:
mov eax,00000400h
add eax,0000000bh
;breakpoint: eax = 0000040B
End Start
From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?
Edit: That was fast :)
A few great answers were given; best answer was chosen based on the first good answer.
Registers are much faster and also the operations that you can perform directly on memory are far more limited.
If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:
CPU registers (slightly complicated, but in the order of 1KB per a core - there
are different types of registers. You might have 16 64 bit
general purpose registers plus a bunch of registers for special
purposes)
L1 cache (64KB per core)
L2 cache (256KB per core)
L3 cache (8MB)
Main memory (8GB)
HDD (1TB)
The internet (big)
Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.
There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).
The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.
One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!
Registers are accessed way faster than RAM memory, since you don't have to access the "slow" memory bus!
x86, like pretty much every other "normal" CPU you might learn assembly for, is a "register machine". There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory), but register machines have proven to be basically the only way to go for high-performance.
Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.
Current x86 CPUs can read/write many more registers per clock cycle than memory locations.
For example, Intel Skylake can do two loads and one store from/to its 32kiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).
Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.
Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).
These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)
As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.
See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N
in one register, you can't get the contents of the N
th register with one insn.)
Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.
Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.
See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).
The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core.
Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.
The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)
We use registers because they are fast. Usually, they operate at CPU's speed.
Registers and CPU cache are made with different technology / fabrics and
they are expensive. RAM on the other hand is cheap and 100 times slower.
Generally speaking register arithmetic is much faster and much preferred. However there are some cases where the direct memory arithmetic is useful.
If all you want to do is increment a number in memory (and nothing else at least for a few million instructions) then a single direct memory arithmetic instruction is usually slightly faster than load/add/store.
Also if you are doing complex array operations you generally need a lot of registers to keep track of where you are and where your arrays end. On older architectures you could run out of register really quickly so the option of adding two bits of memory together without zapping any of your current registers was really useful.
Yes, it's much much much faster to use registers. Even if you only consider the physical distance from processor to register compared to proc to memory, you save a lot of time by not sending electrons so far, and that means you can run at a higher clock rate.
Yes - also you can typically push/pop registers easily for calling procedures, handling interrupts, etc
It's just that the instruction set will not allow you to do such complex operations:
add [0x40001234],[0x40002234]
You have to go through the registers.