Can anyone give me the approximate time (in nanoseconds) to access L1, L2 and L3 caches, as well as main memory on Intel i7 processors?
While this isn\'t specifically a programming question, knowing these kinds of speed details is neccessary for some low-latency programming challenges.
Here is a Performance Analysis Guide for the i7 and Xeon range of processors. I should stress, this has what you need and more (for example, check page 22 for some timings & cycles for example).
Additionally, this page has some details on clock cycles etc. The second link served the following numbers:
Core i7 Xeon 5500 Series Data Source Latency (approximate) [Pg. 22]
local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns )
local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns )
local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns )
local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns )
local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns )
remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns )
local DRAM ~60 ns
remote DRAM ~100 ns
EDIT2
:
The most important is the notice under the cited table, saying:
\"NOTE: THESE VALUES ARE ROUGH APPROXIMATIONS. THEY DEPEND ON
CORE AND UNCORE FREQUENCIES, MEMORY SPEEDS, BIOS SETTINGS,
NUMBERS OF DIMMS, ETC,ETC..YOUR MILEAGE MAY VARY.\"
EDIT: I should highlight that, as well as timing/cycle information, the above intel document addresses much more (extremely) useful details of the i7 and Xeon range of processors (from a performance point of view).
Numbers everyone should know
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
From:
Originally by Peter Norvig:
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine
Cost to access various memories in a pretty page
- See this page presenting the memory latency decrease from 1990 to 2020.
Summary
Values having decreased but are stabilized since 2005
1 ns L1 cache
3 ns Branch mispredict
4 ns L2 cache
17 ns Mutex lock/unlock
100 ns Main memory (RAM)
2 000 ns (2µs) 1KB Zippy-compress
Still some improvements, prediction for 2020
16 000 ns (16µs) SSD random read (olibre\'s note: should be less)
500 000 ns (½ms) Round trip in datacenter
2 000 000 ns (2ms) HDD random read (seek)
See also other sources
- What every programmer should know about memory from Ulrich Drepper (2007)
Old but still an excellent deep explanation about memory hardware and software interaction.
- Full PDF (114 pages)
- Comments on LWN about PDF version
- Another ones
- Seven posts on LWN + Comments
- Part 1 - Introduction
- Part 2 - Cache
- Part 3 - Virtual Memory
- Part 4 - NUMA support
- Part 5 - What programmers can do
- Part 6 - More things programmers can do
- Part 7 - Memory performance tools
- Post The Infinite Space Between Words in codinghorror.com based on book Systems Performance: Enterprise and the Cloud
- Click to each processor listed on http://www.7-cpu.com/ to see the L1/L2/L3/RAM/... latencies (e.g. Haswell i7-4770 has L1=1ns, L2=3ns, L3=10ns, RAM=67ns, BranchMisprediction=4ns)
- http://idarkside.org/posts/numbers-you-should-know/
See also
For further understanding, I recommend the excellent presentation of modern cache architectures (June 2014) from Gerhard Wellein, Hannes Hofmann and Dietmar Fey at University Erlangen-Nürnberg.
French speaking people may appreciate an article by SpaceFox comparing a processor with a developer both waiting for information required to continue to work.
Just for a sake of 2015\'s review of the predictions for 2020:
Still some improvements, prediction for 2020 (Ref. olibre\'s answer below)
-------------------------------------------------------------------------
16 000 ns ( 16 µs) SSD random read (olibre\'s note: should be less)
500 000 ns ( ½ ms) Round trip in datacenter
2 000 000 ns ( 2 ms) HDD random read (seek)
In 2015 there are currently available:
========================================================================
820 ns ( 0.8µs) random read from a SSD-DataPlane
1 200 ns ( 1.2µs) Round trip in datacenter
1 200 ns ( 1.2µs) random read from a HDD-DataPlane
Just for a sake of CPU and GPU latency landscape comparison:
Not an easy task to compare even the simplest CPU / cache / DRAM lineups ( even in a uniform memory access model ), where DRAM-speed is a factor in determining latency, and loaded latency (saturated system), where the latter rules and is something the enterprise applications will experience more than an idle fully unloaded system.
+----------------------------------- 5,6,7,8,9,..12,15,16
| +--- 1066,1333,..2800..3300
v v
First word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate
Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate
Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate
^----------------------- 7x .. difference
********************************
So:
===
resulting DDR3-side latencies are between _____________
3.03 ns ^
|
36.58 ns ___v_ based on DDR3 HW facts
GPU-engines have received a lot of technical marketing, while deep internal dependencies are keys to understand both the real strengths and also the real weaknesses these architectures experience in practice ( typically much different than the aggressive marketing whistled-up expectations ).
1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST:
° ^
|\\ |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre )
| \\ |
| \\ |
__|___\\__v____________________________________________________
| |
|<-->| a 1 ns TimeDOMAIN \"distance\", before a foton arrived
| |
^ v
DATA | |DATA
RQST\'d| |RECV\'d ( DATA XFER/FETCH latency )
25 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor REGISTER access
35 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor L1-onHit-[--8kB]CACHE
70 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor SHARED-MEM access
230 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE
320 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE
350 ns
700 ns @ 1147 MHz FERMI: GPU Streaming Multiprocessor GLOBAL-MEM access
- - - - -
Understanding internalities is thus much more important, than in other fields, where architectures are published and numerous benchmarks freely available. Many thanks to GPU-micro-testers, who \'ve spent their time and creativity to unleash the truth of the real schemes of work inside the black-box approach tested GPU devices.
+====================| + 11-12 [usec] XFER-LATENCY-up HostToDevice ~~~ same as Intel X48 / nForce 790i
| |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
| |||||||||||||||||| ~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput
| |||||||||||||||||| ~ 5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
|
| Host-side
| cudaHostRegister( void *ptr, size_t size, unsigned int flags )
| | +-------------- cudaHostRegisterPortable -- marks memory as PINNED MEMORY for all CUDA Contexts, not just the one, current, when the allocation was performed
| ___HostAllocWriteCombined_MEM / cudaHostFree() +---------------- cudaHostRegisterMapped -- maps memory allocation into the CUDA address space ( the Device pointer can be obtained by a call to cudaHostGetDevicePointer( void **pDevice, void *pHost, unsigned int flags=0 ); )
| ___HostRegisterPORTABLE___MEM / cudaHostUnregister( void *ptr )
| ||||||||||||||||||
| ||||||||||||||||||
| | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 )
| | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes
| | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
|
| + PCIe-3.0 25-port 97-lanes non-blocking SwitchFabric ... +over copper/fiber
| ~~~ The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. [I:]
|
| ~810 [ns] + InRam-\"Network\" / many-to-many parallel CPU/Memory \"message\" passing with less than 810 ns latency any-to-any
|
| ||||||||||||||||||
| ||||||||||||||||||
+====================|
|.pci............HOST|
My apology for a \"bigger-picture\", but latency-demasking has also cardinal limits imposed from on-chip smREG/L1/L2-capacities and hit/miss-rates.
|.pci............GPU.|
| | FERMI [GPU-CLK] ~ 0.9 [ns] but THE I/O LATENCIES PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
| ^^^^^^^^|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [!!]
| smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
| +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
| | ^^^^^^^^
| +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
| | ^^^^^^^^
| ~ +20 [ns] @1147 MHz FERMI ^^^^^^^^
| SM-REGISTERs/thread: max 63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
| max 63 for CC-3.0 - about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
| max 128 for CC-1.x PAR -- ||||||||~~~|
| max 255 for CC-3.5 PAR -- ||||||||||||||||||~~~~~~|
|
| smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) )
| with about 8.0 TB/s BW [C:Pg.46]
| 1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
| 0.1 TB/s BW gloMEM___
| ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========| DEVICE:3 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======| DEVICE:2 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====| DEVICE:1 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==| DEVICE:0 PERSISTENT gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
! | |\\ + |
o | texMEM___|_\\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
| |\\ \\ |\\ + |\\ |
| texL2cache_| \\ \\ .| \\_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \\ 256_KB|
| | \\ \\ | \\ + |\\ ^ \\ |
| | \\ \\ | \\ + | \\ ^ \\ |
| | \\ \\ | \\ + | \\ ^ \\ |
| texL1cache_| \\ \\ .| \\_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | \\_ _ _ _ _^ \\ 5_KB|
| | \\ \\ | \\ + ^\\ ^ \\ ^\\ \\ |
| shaMEM + conL3cache_| \\ \\ | \\ _ _ _ _ conL3cache +220 [GPU_CLKs] ^ \\ ^ \\ ^ \\ \\ 32_KB|
| | \\ \\ | \\ ^\\ + ^ \\ ^ \\ ^ \\ \\ |
| | \\ \\ | \\ ^ \\ + ^ \\ ^ \\ ^ \\ \\ |
| ______________________|__________\\_\\_______________________|__________\\_____^__\\________+__________________________________________\\_________\\_____\\________________________________|
| +220 [GPU-CLKs]_| |_ _ _ ___|\\ \\ \\_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \\ _ _ _ _\\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
| L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB L2_|_ _ _ __|\\\\ \\ \\_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \\ _ _ _ _\\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
| L1-on-re-use-only +40 [GPU-CLKs]_| 8 KB L1_|_ _ _ _|\\\\\\ \\_\\__________________________________\\________\\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
| L1-on-re-use-only + 8 [GPU-CLKs]_| 2 KB L1_|__________|\\\\\\\\__________\\_\\__________________________________\\________\\____+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB|
| on-chip|smREG +22 [GPU-CLKs]_| |t[0_______^:~~~~~~~~~~~~~~~~\\:________]
|CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |t[1_______^ :________]
|2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |t[2_______^ :________]
|1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |t[3_______^ :________]
|3.5 255 REGISTERs|_|_|_|_|_|_|_|_| |t[4_______^ :________]
| per|_|_|_|_|_|_|_|_|_|_|_| |t[5_______^ :________]
| Thread_|_|_|_|_|_|_|_|_|_| |t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W0..|t[ F_______^____________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ..............
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[0_______^:~~~~~~~~~~~~~~~\\:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W1..............|t[ F_______^___________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ....................................................
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[0_______^:~~~~~~~~~~~~~~~\\:________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
|
| ________________ °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
| / \\ CC-2.0|||||||||||||||||||||||||| ~masked ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| / \\ 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
| / \\ 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
|_______________/ \\______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
|~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
| \\ | //
| \\ RR-mode //
| \\ GREEDY-mode //
| \\________________//
| \\______________/SM:0__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:1__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:2__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:3__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:4__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:5__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:6__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:7__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:8__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:9__________________________________________________________________________________
| ..|SM:A |t[ F_______^___________WARP__:________]_______
| ..|SM:B |t[ F_______^___________WARP__:________]_______
| ..|SM:C |t[ F_______^___________WARP__:________]_______
| ..|SM:D |t[ F_______^___________WARP__:________]_______
| |_______________________________________________________________________________________
*/
The bottom line?
Any low-latency motivated design has to rather reverse-engineer the \"I/O-hydraulics\" ( as 0 1-XFERs are incompressible by the nature ) and the resulting latencies rule the performance envelope for any GPGPU solution be it computationally intensive ( read: where processing costs are forgiving a bit more a poor latency XFERs ... ) or not ( read: where ( might be to someone\'s surprise ) CPU-s are faster in end-to-end processing, than GPU fabrics [citations available] ).
Look at this \"staircase\" plot, perfectly illustrating different access times (in terms of clock tics). Notice the red CPU having an additional \"step\", probably because it has L4 (while others don\'t).
Taken from this Extremetech article.
In computer science this is called \"I/O complexity\".