Approximate cost to access various caches and main

2019-01-01 08:01发布

Can anyone give me the approximate time (in nanoseconds) to access L1, L2 and L3 caches, as well as main memory on Intel i7 processors?

While this isn't specifically a programming question, knowing these kinds of speed details is neccessary for some low-latency programming challenges.

109条回答
长期被迫恋爱
2楼-- · 2019-01-01 08:20
骚的不知所云
3楼-- · 2019-01-01 08:21

Just for a sake of 2015's review of the predictions for 2020:

Still some improvements, prediction for 2020 (Ref. olibre's answer below)
-------------------------------------------------------------------------
   16 000 ns ( 16 µs) SSD random read (olibre's note: should be less)
  500 000 ns (  ½ ms) Round trip in datacenter
2 000 000 ns (  2 ms) HDD random read (seek)

In 2015 there are currently available:
========================================================================
      820 ns ( 0.8µs)     random read from a SSD-DataPlane
    1 200 ns ( 1.2µs) Round trip in datacenter
    1 200 ns ( 1.2µs)     random read from a HDD-DataPlane

Just for a sake of CPU and GPU latency landscape comparison:

Not an easy task to compare even the simplest CPU / cache / DRAM lineups ( even in a uniform memory access model ), where DRAM-speed is a factor in determining latency, and loaded latency (saturated system), where the latter rules and is something the enterprise applications will experience more than an idle fully unloaded system.

                    +----------------------------------- 5,6,7,8,9,..12,15,16 
                    |                               +--- 1066,1333,..2800..3300
                    v                               v
First  word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate  
Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate
Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate
                                        ^----------------------- 7x .. difference
******************************** 
So:
===

resulting DDR3-side latencies are between _____________
                                          3.03 ns    ^
                                                     |
                                         36.58 ns ___v_ based on DDR3 HW facts

Uniform Memory Access

GPU-engines have received a lot of technical marketing, while deep internal dependencies are keys to understand both the real strengths and also the real weaknesses these architectures experience in practice ( typically much different than the aggressive marketing whistled-up expectations ).

   1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST:
          °      ^
          |\     |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre )
          | \    |
          |  \   |
        __|___\__v____________________________________________________
          |    |
          |<-->|  a 1 ns TimeDOMAIN "distance", before a foton arrived
          |    |
          ^    v 
    DATA  |    |DATA
    RQST'd|    |RECV'd ( DATA XFER/FETCH latency )

  25 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor REGISTER access
  35 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor    L1-onHit-[--8kB]CACHE

  70 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor SHARED-MEM access

 230 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE
 320 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE

 350 ns
 700 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor GLOBAL-MEM access
 - - - - -

Understanding internalities is thus much more important, than in other fields, where architectures are published and numerous benchmarks freely available. Many thanks to GPU-micro-testers, who 've spent their time and creativity to unleash the truth of the real schemes of work inside the black-box approach tested GPU devices.

    +====================| + 11-12 [usec] XFER-LATENCY-up   HostToDevice    ~~~ same as Intel X48 / nForce 790i
    |                                                                       
查看更多
倾城一夜雪
4楼-- · 2019-01-01 08:21
浮光初槿花落
5楼-- · 2019-01-01 08:21
笑指拈花
6楼-- · 2019-01-01 08:22

Numbers everyone should know

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

From: Originally by Peter Norvig:
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine

a visual comparison

查看更多
若你有天会懂
7楼-- · 2019-01-01 08:22
登录 后发表回答