How do I calculate clock cycles on mips assembly p

2019-08-21 09:49发布

问题:

I've searched up everywhere, and ive gathered pipelines or something. I've checked other programs and it seems like theres a single-cycle & multicycle: Clock cycles of Single-cycle and multi-cycle MIPS

How do I tell the difference for what cycle.

like for example, how many clock cycles would this be:

li $a0, 0 # $a0 = 0
 li $t0, 10 # Initialize loop counter to 10
Loop:
 add $a0, $a0, $t0
 addi $t0, $t0, -1 # Decrement loop counter
 bgt $t0, $zero, Loop # If ($t0 > 0) Branch to loop

My professor gave me this for an assignment: "Assume that loading and storing values in memory cost 100 cycles plus the cost of the instruction itself."

From what I've read, loading is 5 cycles, so why is my professor saying its 100 cycles. Can I make it whatever number I want? I'm very confused.

回答1:

This question doesn't make sense.

The standard multi-cycle RISC pipeline, as used in most educational MIPS implementations, is fundamentally based around the requirement that program and data memory can both be accessed simultaneously in a single cycle. To "assume that loading and storing values in memory cost 100 cycles" would require a completely different architecture.



回答2:

We have to distinguish between two cases:

Case 1: MIPS simulators

From what I've read, loading is 5 cycles, so why is my professor saying its 100 cycles.

You are not working with a real CPU but only with a simulated one. So arguing how many cycles your program "really" needs on the simulated CPU won't make sense.

Maybe the simulator simulates 5 cycles for each memory access. However, another simulator may simulate 10 cycles or only 1 cycle for a memory access.

This means that you'll have to say which simulator is used when talking about the number of simulated cycles. And your professor says that a simulator simulating 100 cycles shall be assumed.

Case 2: Real MIPS CPUs

Can I make it whatever number I want?

In this case you'll have to check the manual of the CPU to get the real number of cycles the CPUs need.

However, the instruction set of real MIPS-type CPUs is not 100% identical to the one of "MIPS" emulators. In your program the instruction bgt would work differently.

This means we also cannot argue how many cycles your program would need on a real MIPS CPU because we had to modify it before we can run it on a real MIPS CPU - and this would possibly change the number of cycles needed.

If you want to know if the number 100 is plausible when using a real CPU:

As we know from the "Spectre" and "Meltdown" security breaches, the time required to read memory on real CPUs is massively depending on the state of the CPU caches. If we assume that a0 points to a memory-mapped peripheral (which is never cached), 100 cycles may be plausible.



回答3:

"Assume that loading and storing values in memory cost 100 cycles plus the cost of the instruction itself."

That makes little sense. Unless you're supposed to assume instruction fetch is that slow as well, this would be like having an instruction cache but no data cache.

(Or running your program with data memory mapped to an uncacheable memory region.)

As @Martin points out, you can make up any numbers you want and simulate accordingly, even if there's no plausible engineering reason that a CPU would be built that way.

If you're trying to model the performance of an emulator like MARS itself on the host CPU, it's still not particularly plausible that loads/stores would be that expensive either. Interpreter costs per instruction vary wildly with how well branch prediction works on the host CPU, and emulating the guest memory is a small part of what an interpreting emulator does.


Loads on modern x86 typically have 5 cycle latency for an L1d cache hit (from address being ready to data being ready), but they also have 2-per-clock throughput. Thus, even without any cache misses, up to 10 loads can be in flight at once just in two pipelined load execution units of an Intel Sandybridge-family CPU, or AMD K8 / Bulldozer / Zen. (With load buffers to track cache-miss loads, far more loads can be in flight at once in the whole out-of-order back-end.)

You can't say that a load "costs" 5 cycles on such a CPU unless you're talking about a specific context of surrounding code, e.g. traversing a linked list where you bottleneck on load latency because the address for the next load depends on the result of the current load.


In traversing an array, you typically generate the next pointer with an add instruction (or MIPS addui), which has 1 cycle latency. Even if loads had 5 cycle latency on a simple in-order pipeline, unrolling + software pipelining will let you sustain 1 load per clock cycle.

On a pipelined CPU, performance is not one-dimensional, you can't just put cost numbers on instructions and add them up. You can see this even for ALU instructions on classic in-order MIPS with a multiply instruction: if you don't use mflo / mhi right away, the multiply latency is hidden by whatever instructions you fill that gap with.


As @duskwuff point out, a classic RISC 5-stage pipeline (fetch/decode/exec/mem/write-back) like first-gen MIPS assumes cache hits have 1/clock memory throughput and latency for access to L1d itself. But the MEM stage leaves room for 2 cycle latency for loads (including address-generation in the EX stage).

And I guess they didn't need a store buffer either. More complex CPUs use a store buffer to decouple execution from stores that can potentially miss in L1d cache, hiding store latency even for cache misses. That's very nice, but not required.

Early CPUs often used simple direct-mapped virtually addressed caches, enabling such low cache latency without crippling max clock speed. But on a cache miss, the pipeline stalls. https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Cache_miss_handling.

More complex in-order CPUs can scoreboard loads instead of stalling if any of them miss in cache, and only stall if a later instruction tries to actually read a register that was last written by a load that hasn't completed yet. This allows hit-under-miss, and multiple outstanding cache misses to create memory parallelism, allowing multiple 100-cycle memory accesses to be in flight at once.


But fortunately for you, your loop doesn't include any memory accesses in the first place. It's pure ALU + branch.

On a real MIPS with branch-delay slots, you might write it like this:

 li   $t0, 10                        # loop counter
 li   $a0, 10                        # total = counter    # peel the first iteration

Loop:                                # do{
 addi $t0, $t0, -1
 bgtz $t0, Loop                      # } while($t0 > 0);
 add $a0, $a0, $t0                   # branch-delay: always executed for taken or not

This is still just 10+9+8+...+0 = (10*11)/2 which would be better done with a multiply instead of a loop. But that's not the point, we're analyzing the loop. We do the same number of additions, but we do a += 0 at the end instead of 0 + 10 at the start.

Notice I used the real MIPS bgtz instruction, not the bgt pseudo-instruction with $zero. Hopefully an assembler would choose that for the special case of $zero, but it might just follow the normal pattern of using slt $at, $zero, $t0 / bne $at, $zero, target.

Classic MIPS doesn't do branch prediction + speculative execution (it has a branch-delay slot to hide the bubble from the control dependency instead). But for this to work, it needs the branch input ready in the ID stage, so reading the result of the previous add (which produces a result at the end of EX) will cause a 1 cycle stall. (Or worse depending on whether forwarding to the ID stage is supported. https://courses.engr.illinois.edu/cs232/sp2009/exams/final/2002-spring-final-sol.pdf question 2 Part (a) has an example like this, but I think they under-count the stall cycles if you need to wait for add WB to complete before bne/bgtz ID can start.)

So anyway, this should run at best 1 iteration per 4 cycles on a scalar in-order MIPS I that can forward from EX to ID. 3 instructions + 1 stall cycle before each bgtz.


We can optimize it by putting the add $a0, $a0, $t0 between the loop counter and the branch, filling that stall cycle with useful work.

 li    $a0, 10                        # total = counter    # peel the first iteration
 li    $t0, 10-1                      # loop counter, with first -- peeled

Loop:                                 # do{
                                          # counter--   (in the branch delay slot and peeled before first iteration)
 add   $a0, $a0, $t0                      # total += counter
 bgtz  $t0, Loop                      # } while($t0 > 0);
 addi  $t0, $t0, -1

This runs at 3 cycles / iteration, for 3 instructions with no stall cycles (again assuming forwarding from EX to ID).

Putting the counter-- in the branch delay slot puts it as far before the next execution of the loop branch as possible. A simple bne instead of bgtz would work, too; we know the loop counter starts out signed positive and decreases by 1 each iteration, so it's not critical that we keep checking for non-negative as well as non-zero.


I have no idea what performance model you're using. If it's not a classic 5-stage MIPS then the above is irrelevant.