What is low latency access of data?

What do you mean by low latency access of data?

I am actually confused about the definition of the term "LATENCY".

Can anyone please elaborate the term "Latency".

标签： performance memory dataflow low-latency multiplexing

2条回答

干净又极端

2楼-- · 2019-03-08 08:38

Latency - The time it takes to access data.
Bandwidth - How much data you can get.

The classic example:

A wagon full of backup tapes is high latency, high bandwidth. There's a lot of information in those backup tapes, but it takes a long time for a wagon to get anywhere.

Low latency networks are important for streaming services. Voice streaming needs very low bandwidth (4 kbps for telephone quality AFAIR) but needs the packets to arrive fast. A voice call on a high latency network results in time lag between the speakers, even if there is enough bandwidth.

Other applications where latency is important:

Some types of online gaming (FPS, RTS, etc.)
Algorithmic trading

0人赞添加讨论(0) 举报

孤傲高冷的网名

3楼-- · 2019-03-08 08:57

LATENCY - an amount of time to get the response [us]
BANDWIDTH - an amount of data-flow volume per unit of time [GB/s]`

Marketing papers are fabulous in mystifications with `LATENCY` figures

A term latency could be confused, if not taking carefully this whole context of transaction life-cycle: participating line-segments { amplification | retiming | switching | MUX/MAP-ing | routing | EnDec-processing (not speaking about cryptography ) | statistical-(de)compressing }, data-flow duration and framing / line-code-protective add-ons / ( opt. procotol, if present, encapsulation and re-framing ) additional surplus overheads, that continually increase latency but also increase data-VOLUME.

Just as an example, take any GPU-engine marketing. The huge numbers that are presented about GigaBytes of DDR5 and GHz timing thereof silently are communicated in bold, what they omit to tell you is, that with all that zillions of things, each of your SIMT many-cores, yes, all the cores, have to pay a cruel latency-penalty and wait for more than +400-800 [GPU-clk]s just to receive the first byte from GPU-over-hyped-GigaHertz-Fast-DDRx-ECC-protected bank of memory.

Yes, your Super-Engine's GFLOPs/TFLOPs have to wait! ... because of (hidden) LATENCY

And you wait with all the full parallel-circus ... because of LATENCY

( ... and any marketing bell or whistle cannot help, believe or not ( forget about cache promises too, these do not know, what the hell there would be in the far / late / distant memory cell, so cannot feed you a single bit copy of such latency-"far" enigma from their shallow local-pockets ) )

`LATENCY` ( and taxes ) cannot be avoided

Highly professional HPC-designs only help to pay less penalty, while still cannot avoid LATENCY (as taxes) penalty beyond some smart re-arrangements principles.

 CUDA Device:0_ has <_compute capability_> == 2.0.
 CUDA Device:0_ has [    Tesla M2050] .name
 CUDA Device:0_ has [             14] .multiProcessorCount         [ Number of multiprocessors on device ]
 CUDA Device:0_ has [     2817982464] .totalGlobalMem              [ __global__   memory available on device in Bytes [B] ]
 CUDA Device:0_ has [          65536] .totalConstMem               [ __constant__ memory available on device in Bytes [B] ]
 CUDA Device:0_ has [        1147000] .clockRate                   [ GPU_CLK frequency in kilohertz [kHz] ]
 CUDA Device:0_ has [             32] .warpSize                    [ GPU WARP size in threads ]
 CUDA Device:0_ has [        1546000] .memoryClockRate             [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
 CUDA Device:0_ has [            384] .memoryBusWidth              [ GPU_DDR Global memory bus width in bits [b] ]
 CUDA Device:0_ has [           1024] .maxThreadsPerBlock          [ MAX Threads per Block ]
 CUDA Device:0_ has [          32768] .regsPerBlock                [ MAX number of 32-bit Registers available per Block ]
 CUDA Device:0_ has [           1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
 CUDA Device:0_ has [         786432] .l2CacheSize
 CUDA Device:0_ has [          49152] .sharedMemPerBlock           [ __shared__   memory available per Block in Bytes [B] ]
 CUDA Device:0_ has [              2] .asyncEngineCount            [ a number of asynchronous engines ]

Yes, telephone!
Why not?

A cool point to remind
a 8kHz-8bit-sampling on a 64k circuit switching
used inside an E1/T1 TELCO hierarchy

A POTS telephone service used to be based on a synchronous fix-latency switching ( late 70-ies have merged global, otherwise in-synchronise-able Plesiochronous Digital Hierarchy networks between Japanese-PDH-standard, Continental-PDH-E3 inter-carrier standards and US-PDH-T3 carrier services, which finally avoided many headaches with international carrier service jitter / slippage / (re)-synchronisation storms and drop-outs )

SDH/SONET-STM1 / 4 / 16, carried on 155 / 622 / 2488 [Mb/s] BANDWIDTH SyncMUX-circuits.

The cool idea on SDH was the globally enforced fix structure of time-aligned framing, which was both deterministic and stable.

This allowed to simply memory-map (cross-connect switch) lower-order container-datastream components to be copied from incoming STMx onto outgoing STMx/PDHy payloads on the SDH-cross-connects ( remember, that was as deep as in late 70-ies so the CPU performance and DRAMs were decades before handling GHz and sole ns ). Such a box-inside-a-box-inside-a-box payload mapping provided both low-switching overheads on the hardware and provided also some means for re-alignment in time-domain ( there were some bit-gaps between the box-in-box boundaries, so as to provide some elasticity, well under a standard given maximum skew in time )

While it may be hard to explain the beauty of this concept in a few words, AT&T and other major global operators enjoyed a lot the SDH-synchronicity and the beauty of the globally-synchronous SDH network and local side Add-Drop-MUX mappings.

Having said this,
latency controlled design
takes care of:
- ACCESS-LATENCY :how long time does it take to arrive for the first ever bit : [s]
- TRANSPORT-BANDWIDTH :how many bits it can transfer/deliver each next unit of time: [b/s]
- VOLUME OF DATA :how many bits of data are there in total to transport : [b]
- TRANSPORT DURATION :how many units of time does it take
- ___________________ :to move/deliver whole VOLUME OF DATAto who has asked: [s]

Epilogue:

A very nice illustration of the principal independence of a THROUGHPUT ( BANDWIDTH [GB/s] ) on LATENCY [ns] is in Fig.4 in a lovely ArXiv paper on Improving Latency from Ericsson, testing how manycore RISC-procesor Epiphany-64 architecture from Adapteva may help in driving LATENCY down in signal processing.

Understanding the Fig.4, extended in core-dimension,
can also show the possible scenarios

- how to increase of BANDWIDTH [GB/s]
by more-core(s) involved into accelerated / TDMux-ed [Stage-C]-processing ( interleaved in time )
and also
- that LATENCY [ns]
can never be shorter than a sum of principal SEQ-process-durations == [Stage-A]+[Stage-B]+[Stage-C], independently of the number of available ( single/many )-cores the architecture permits to use.
Great thanks to Andreas Olofsson & the Ericsson guys. KEEP WALKING, BRAVE MEN!

0人赞添加讨论(0) 举报