How to determine CPE: Cycles Per Element

How do I determine the CPE of a program? For example, I have this assembly code for a loop:

# inner4: data_t = float
# udata in %rbx, vdata in %rax, limit in %rcx,
# i in %rdx, sum in %xmm1
1 .L87:                                   # loop:
2   movss  (%rbx,%rdx,4), %xmm0           #  Get udata[i]
3   mulss  (%rax,%rdx,4), %xmm0           #  Multiply by vdata[i]
4   addss  %xmm0, %xmm1                   #  Add to sum
5   addq  $1, %rdx                        #  Increment i
6   cmpq  %rcx, %rdx                      #  Compare i:limit
7   jl .L87                               #  If <, goto loop

I have to find the lower bound of the CPE determined by the critical path using the data type float. I believe that the critical path would refer to the slowest possible path, and would thus be the one where the program has to execute the mulss instruction because that takes up the longest number of clock cycles.

However, there doesn't seem to be any clear way to determine the CPE. If one instruction takes two clock cycles, and another takes one, can the latter start after the first clock cycle of the former? Any help would be appreciated. Thanks

标签： performance architecture assembly x86 sse

2条回答

Lonely孤独者°

2楼-- · 2019-07-19 07:38

If you want to know how long it needs, you should measure it. Execute the loop some about 10^10 times, take the time it needs and multiply by the clock frequency. You get the total count of cycles, divide by 10^10 to get the number of clock cycles per loop iteration.

A theoretical prediction of the execution time will almost never be correct (and most of time to low) because the are numerous effects which determine the speed:

Pipelining (there can be easily about 20 stages in the pipeline)
Superscalar execution (up to 5 instructions in parallel, cmp and jl may be fused)
Decoding to µOps and reordering
The latencies of Caches or Memory
The throughput of the instructions (are there enough executions ports free)
The latencies of the instructions
Bank conflicts, aliasing issues and more esoteric stuff

Depending on the CPU and provided the memory accesses all hit the L1 cache, I believe the loop should need at least 3 clock cycles per iteration, because the longest dependency chain is 3 elements long. On an older CPU with slower mulss or addss instruction the time needed increases.

If you are actually interested in speeding up the code and not only some theoretical observations you should vectorize it. You can increase the performance by a factor of 4-8 with something like

.L87:                               # loop:
vmovdqa (%rbx,%rdx,4), %ymm0        #  Get udata[i]..udata[i+7]
vmulps  (%rax,%rdx,4), %ymm0, %ymm0 #  Multiply by vdata[i]..vdata[i+7]
vaddps  %ymm0, %ymm1, %ymm1         #  Add to sum
addq    $8, %rdx                    #  Increment i
cmpq    %rcx, %rdx                  #  Compare i:limit
jl .L87                             #  If <, goto loop

You need to horizontal add all 8 elements after that and of course make sure alignment is 32 and loop counter divisible by 8.

0人赞添加讨论(0) 举报

虎瘦雄心在

3楼-- · 2019-07-19 07:59

If you're running an an Intel CPU, you can find some good documentation on instruction latency and throughput for various CPUs. Here's the link:

Intel® 64 and IA-32 Architectures Optimization Reference Manual

0人赞添加讨论(0) 举报

How to determine CPE: Cycles Per Element

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间