可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?

How many cycles does an L1 cache hit take? How does it compare to register access?

回答1:

Here's a great article on the subject:

http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1

To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)

PS:

The specifics will vary, but this link has some good ballpark figures:

Approximate cost to access various caches and main memory?

Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns

PPS:

These figures represent much older, slower CPUs, but the ratios basically hold:

http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2

Level                    Access Time  Typical Size  Technology    Managed By
-----                    -----------  ------------  ---------     -----------
Registers                1-3 ns       ?1 KB          Custom CMOS  Compiler
Level 1 Cache (on-chip)  2-8 ns       8 KB-128 KB    SRAM         Hardware
Level 2 Cache (off-chip) 5-12 ns      0.5 MB - 8 MB  SRAM         Hardware
Main Memory              10-60 ns     64 MB - 1 GB   DRAM         Operating System
Hard Disk                3M - 10M ns  20 - 100 GB    Magnetic     Operating System/User

回答2:

For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.

Intel Haswell's L1 load-use latency is 4 cycles, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or to a small closed-loop linked list).

Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.

Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, not L1 cache).

As harold commented, register access is 0 cycles. So, for example:

inc eax has 1 cycle latency (just the ALU operation)
inc dword [mem] has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)

http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).

Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).

L1 Data cache = 32 KB, 64 B/line, 8-WAY.

L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.

L2 cache = 256 KB, 64 B/line, 8-WAY

L3 cache = 8 MB, 64 B/line

L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])

L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).

L2 Cache Latency = 12 cycles

L3 Cache Latency = 36 cycles

RAM Latency = 36 cycles + 57 ns

The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.

@paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.

回答3:

If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.

EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\

回答4:

Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.

As result:

1) I think you can consider L1 cache as an unmanaged processor registers pool.

2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.