How to determine CPE: Cycles Per Element

2019-07-19 07:32发布

How do I determine the CPE of a program? For example, I have this assembly code for a loop:

# inner4: data_t = float
# udata in %rbx, vdata in %rax, limit in %rcx,
# i in %rdx, sum in %xmm1
1 .L87:                                   # loop:
2   movss  (%rbx,%rdx,4), %xmm0           #  Get udata[i]
3   mulss  (%rax,%rdx,4), %xmm0           #  Multiply by vdata[i]
4   addss  %xmm0, %xmm1                   #  Add to sum
5   addq  $1, %rdx                        #  Increment i
6   cmpq  %rcx, %rdx                      #  Compare i:limit
7   jl .L87                               #  If <, goto loop

I have to find the lower bound of the CPE determined by the critical path using the data type float. I believe that the critical path would refer to the slowest possible path, and would thus be the one where the program has to execute the mulss instruction because that takes up the longest number of clock cycles.

However, there doesn't seem to be any clear way to determine the CPE. If one instruction takes two clock cycles, and another takes one, can the latter start after the first clock cycle of the former? Any help would be appreciated. Thanks

2条回答
Lonely孤独者°
2楼-- · 2019-07-19 07:38

If you want to know how long it needs, you should measure it. Execute the loop some about 10^10 times, take the time it needs and multiply by the clock frequency. You get the total count of cycles, divide by 10^10 to get the number of clock cycles per loop iteration.

A theoretical prediction of the execution time will almost never be correct (and most of time to low) because the are numerous effects which determine the speed:

  • Pipelining (there can be easily about 20 stages in the pipeline)
  • Superscalar execution (up to 5 instructions in parallel, cmp and jl may be fused)
  • Decoding to µOps and reordering
  • The latencies of Caches or Memory
  • The throughput of the instructions (are there enough executions ports free)
  • The latencies of the instructions
  • Bank conflicts, aliasing issues and more esoteric stuff

Depending on the CPU and provided the memory accesses all hit the L1 cache, I believe the loop should need at least 3 clock cycles per iteration, because the longest dependency chain is 3 elements long. On an older CPU with slower mulss or addss instruction the time needed increases.

If you are actually interested in speeding up the code and not only some theoretical observations you should vectorize it. You can increase the performance by a factor of 4-8 with something like

.L87:                               # loop:
vmovdqa (%rbx,%rdx,4), %ymm0        #  Get udata[i]..udata[i+7]
vmulps  (%rax,%rdx,4), %ymm0, %ymm0 #  Multiply by vdata[i]..vdata[i+7]
vaddps  %ymm0, %ymm1, %ymm1         #  Add to sum
addq    $8, %rdx                    #  Increment i
cmpq    %rcx, %rdx                  #  Compare i:limit
jl .L87                             #  If <, goto loop

You need to horizontal add all 8 elements after that and of course make sure alignment is 32 and loop counter divisible by 8.

查看更多
虎瘦雄心在
3楼-- · 2019-07-19 07:59

If you're running an an Intel CPU, you can find some good documentation on instruction latency and throughput for various CPUs. Here's the link:

Intel® 64 and IA-32 Architectures Optimization Reference Manual

查看更多
登录 后发表回答