What do you mean by low latency access of data?
I am actually confused about the definition of the term "LATENCY".
Can anyone please elaborate the term "Latency".
What do you mean by low latency access of data?
I am actually confused about the definition of the term "LATENCY".
Can anyone please elaborate the term "Latency".
The classic example:
A wagon full of backup tapes is high latency, high bandwidth. There's a lot of information in those backup tapes, but it takes a long time for a wagon to get anywhere.
Low latency networks are important for streaming services. Voice streaming needs very low bandwidth (4 kbps for telephone quality AFAIR) but needs the packets to arrive fast. A voice call on a high latency network results in time lag between the speakers, even if there is enough bandwidth.
Other applications where latency is important:
LATENCY -
an amount of time to get the response [us]
BANDWIDTH -
an amount of data-flow volume per unit of time [GB
/s
]`LATENCY
figuresA term latency could be confused, if not taking carefully this whole context of transaction life-cycle: participating line-segments { amplification | retiming | switching | MUX/MAP-ing | routing | EnDec-processing (not speaking about cryptography ) | statistical-(de)compressing }, data-flow duration and framing / line-code-protective add-ons / ( opt. procotol, if present, encapsulation and re-framing ) additional surplus overheads, that continually increase latency but also increase data-VOLUME
.
Just as an example, take any GPU-engine marketing. The huge numbers that are presented about GigaBytes of DDR5
and GHz
timing thereof silently are communicated in bold, what they omit to tell you is, that with all that zillions of things, each of your SIMT
many-cores, yes, all the cores, have to pay a cruel latency-penalty and wait for more than +400-800
[GPU-clk]
s just to receive the first byte from GPU-over-hyped-GigaHertz-Fast-DDRx-ECC-protected bank of memory.
Yes, your Super-Engine's GFLOPs/TFLOPs
have to wait! ... because of (hidden) LATENCY
And you wait with all the full parallel-circus ... because of LATENCY
( ... and any marketing bell or whistle cannot help, believe or not ( forget about cache promises too, these do not know, what the hell there would be in the far / late / distant memory cell, so cannot feed you a single bit copy of such latency-"far" enigma from their shallow local-pockets ) )
LATENCY
( and taxes ) cannot be avoidedHighly professional HPC
-designs only help to pay less penalty, while still cannot avoid LATENCY
(as taxes) penalty beyond some smart re-arrangements principles.
CUDA Device:0_ has <_compute capability_> == 2.0.
CUDA Device:0_ has [ Tesla M2050] .name
CUDA Device:0_ has [ 14] .multiProcessorCount [ Number of multiprocessors on device ]
CUDA Device:0_ has [ 2817982464] .totalGlobalMem [ __global__ memory available on device in Bytes [B] ]
CUDA Device:0_ has [ 65536] .totalConstMem [ __constant__ memory available on device in Bytes [B] ]
CUDA Device:0_ has [ 1147000] .clockRate [ GPU_CLK frequency in kilohertz [kHz] ]
CUDA Device:0_ has [ 32] .warpSize [ GPU WARP size in threads ]
CUDA Device:0_ has [ 1546000] .memoryClockRate [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
CUDA Device:0_ has [ 384] .memoryBusWidth [ GPU_DDR Global memory bus width in bits [b] ]
CUDA Device:0_ has [ 1024] .maxThreadsPerBlock [ MAX Threads per Block ]
CUDA Device:0_ has [ 32768] .regsPerBlock [ MAX number of 32-bit Registers available per Block ]
CUDA Device:0_ has [ 1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
CUDA Device:0_ has [ 786432] .l2CacheSize
CUDA Device:0_ has [ 49152] .sharedMemPerBlock [ __shared__ memory available per Block in Bytes [B] ]
CUDA Device:0_ has [ 2] .asyncEngineCount [ a number of asynchronous engines ]
A POTS
telephone service used to be based on a synchronous fix-latency
switching ( late 70-ies have merged global, otherwise in-synchronise-able Plesiochronous Digital Hierarchy networks between Japanese-PDH
-standard, Continental-PDH
-E3
inter-carrier standards and US-PDH
-T3
carrier services, which finally avoided many headaches with international carrier service jitter / slippage / (re)-synchronisation storms and drop-outs )
SDH
/SONET-STM1 / 4 / 16
, carried on 155 / 622 / 2488 [Mb/s]
BANDWIDTH
SyncMUX-circuits.
The cool idea on SDH
was the globally enforced fix structure of time-aligned framing, which was both deterministic and stable.
This allowed to simply memory-map (cross-connect switch) lower-order container-datastream components to be copied from incoming STMx onto outgoing STMx/PDHy payloads on the SDH-cross-connects ( remember, that was as deep as in late 70-ies so the CPU performance and DRAMs were decades before handling GHz
and sole ns
). Such a box-inside-a-box-inside-a-box payload mapping provided both low-switching overheads on the hardware and provided also some means for re-alignment in time-domain ( there were some bit-gaps between the box-in-box boundaries, so as to provide some elasticity, well under a standard given maximum skew in time )
While it may be hard to explain the beauty of this concept in a few words, AT&T and other major global operators enjoyed a lot the SDH-synchronicity and the beauty of the globally-synchronous SDH network and local side Add-Drop-MUX mappings.
Having said this,
latency controlled design
takes care of:
- ACCESS-LATENCY :
how long time does it take to arrive for the first ever bit : [s]
- TRANSPORT-BANDWIDTH :
how many bits it can transfer/deliver each next unit of time: [b/s]
- VOLUME OF DATA :
how many bits of data are there in total to transport : [b]
- TRANSPORT DURATION :
how many units of time does it take
- ___________________ :
to move/deliver whole VOLUME OF DATA
to who has asked: [s]
A very nice illustration of the principal independence of a THROUGHPUT ( BANDWIDTH
[GB/s]
) on LATENCY[ns]
is in Fig.4 in a lovely ArXiv paper on Improving Latency from Ericsson, testing how manycore RISC-procesor Epiphany-64 architecture from Adapteva may help in driving LATENCY down in signal processing.
Understanding the Fig.4, extended in core-dimension,
can also show the possible scenarios
- how to increase of BANDWIDTH[GB/s]
by more-core(s) involved into accelerated / TDMux-ed[Stage-C]
-processing ( interleaved in time )
and also
- that LATENCY[ns]
can never be shorter than a sum of principalSEQ
-process-durations== [Stage-A]
+[Stage-B]
+[Stage-C]
, independently of the number of available ( single/many )-cores the architecture permits to use.
Great thanks to Andreas Olofsson & the Ericsson guys. KEEP WALKING, BRAVE MEN!