I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?
How many cycles does an L1 cache hit take? How does it compare to register access?
Here's a great article on the subject:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1
To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)
PS:
The specifics will vary, but this link has some good ballpark figures:
Approximate cost to access various caches and main memory?
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns
PPS:
These figures represent much older, slower CPUs, but the ratios basically hold:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2
Level Access Time Typical Size Technology Managed By
----- ----------- ------------ --------- -----------
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3M - 10M ns 20 - 100 GB Magnetic Operating System/User
For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.
Intel Haswell's L1 load-use latency is 4 cycles, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax]
can run in a loop, with a pointer that points to itself. (Or to a small closed-loop linked list).
Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.
Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, not L1 cache).
As harold commented, register access is 0 cycles. So, for example:
inc eax
has 1 cycle latency (just the ALU operation)
inc dword [mem]
has 6 cycle latency until a load from dword [mem]
will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi]
has 4 cycle latency from rsi
being ready to rax
being ready on an L1 hit (L1 load-use latency.)
http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
- L1 Data cache = 32 KB, 64 B/line, 8-WAY.
- L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
- L2 cache = 256 KB, 64 B/line, 8-WAY
L3 cache = 8 MB, 64 B/line
L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax]
)
- L1 Data Cache Latency = 5 cycles for access with complex address calculation (
mov rax, [rsi + rax*8]
).
- L2 Cache Latency = 12 cycles
- L3 Cache Latency = 36 cycles
- RAM Latency = 36 cycles + 57 ns
The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.
@paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.
If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.
EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\
Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.
As result:
1) I think you can consider L1 cache as an unmanaged processor registers pool.
2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.