可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I had coded the 80c51 architecture in VHDL using xilinx. In an attempt to increase the clock frequency, I had pipelined all the 80c51 instructions. The instructions were able to execute as desired, for eg. when the 1st instruction is being processed, the second instruction gets fetched.

However, I only get a slightly higher clock frequency of (around +/-10Hz) despite creating a pipeline depth of 3, from the synthesis report. I figured out that the bottleneck is due to one operation as specified by the synthesis report, but I could not understand synthesis report.

May I ask what is the data path from 'SEQ/decode_3 to SEQ/i_ram_addr_7' trying to do? (From my guess, i deduce that the use a case, when statement to check the 100+ relevant opcode but not sure if that is the bottleneck. But I am clueless)

Hence, my only 2 queries are:

Firstly, is it possible that pipelining does not increase the clock frequency and the testbench is the only way to explain the reduce in timing?

Secondly, how could I deduce which path in my code that is the bottleneck from 'SEQ/decode_3 to SEQ/i_ram_addr_7'.

Thank you for anyone who can help to explain my doubts!

Timing Summary:
---------------
Speed Grade: -4

   Minimum period: 12.542ns (Maximum Frequency: 79.730MHz)
   Minimum input arrival time before clock: 10.501ns
   Maximum output required time after clock: 5.698ns
   Maximum combinational path delay: No path found

Timing Detail:
--------------
All values displayed in nanoseconds (ns)

=========================================================================
Timing constraint: Default period analysis for Clock 'clk'
  Clock period: 12.542ns (frequency: 79.730MHz)
  Total number of paths / destination ports: 113114 / 2670
-------------------------------------------------------------------------
Delay:               12.542ns (Levels of Logic = 10)
  Source:            SEQ/decode_3 (FF)
  Destination:       SEQ/i_ram_addr_7 (FF)
  Source Clock:      clk rising
  Destination Clock: clk rising

  Data Path: SEQ/decode_3 to SEQ/i_ram_addr_7
                                Gate     Net
    Cell:in->out      fanout   Delay   Delay  Logical Name (Net Name)
    ----------------------------------------  ------------
     FDC:C->Q            102   0.591   1.364  SEQ/decode_3 (SEQ/decode_3)
     LUT4_D:I1->O         10   0.643   0.885  SEQ/de_state_cmp_eq002111 (N314)
     LUT4:I3->O            7   0.648   0.740  SEQ/de_state_cmp_eq00711 (SEQ/de_state_cmp_eq0071)
     LUT4:I2->O            3   0.648   0.534  SEQ/i_ram_addr_mux0000<0>11111 (N2301)
     LUT4:I3->O            1   0.648   0.000  SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0_F (N1284)
     MUXF5:I0->O           1   0.276   0.423  SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0 (N955)
     LUT4_D:I3->O          6   0.648   0.701  SEQ/i_ram_addr_mux0000<0>11270 (SEQ/i_ram_addr_mux0000<0>11270)
     LUT3_L:I2->LO         1   0.648   0.103  SEQ/i_ram_addr_mux0000<7>221_SW2_SW0 (N1208)
     LUT4:I3->O            1   0.648   0.423  SEQ/i_ram_addr_mux0000<7>351_SW1 (N1085)
     LUT4:I3->O            1   0.648   0.423  SEQ/i_ram_addr_mux0000<7>2 (SEQ/i_ram_addr_mux0000<7>2)
     LUT4:I3->O            1   0.648   0.000  SEQ/i_ram_addr_mux0000<7>167 (SEQ/i_ram_addr_mux0000<7>)
     FDE:D                     0.252          SEQ/i_ram_addr_7
    ----------------------------------------
    Total                     12.542ns (6.946ns logic, 5.596ns route)
                                       (55.4% logic, 44.6% route)

=========================================================================
Timing constraint: Default OFFSET IN BEFORE for Clock 'clk'
  Total number of paths / destination ports: 154 / 154
-------------------------------------------------------------------------
Offset:              8.946ns (Levels of Logic = 6)
  Source:            rst (PAD)
  Destination:       SEQ/i_ram_diByte_1 (FF)
  Destination Clock: clk rising

  Data Path: rst to SEQ/i_ram_diByte_1
                                Gate     Net
    Cell:in->out      fanout   Delay   Delay  Logical Name (Net Name)
    ----------------------------------------  ------------
     IBUF:I->O           444   0.849   1.392  rst_IBUF (REG/ext_int/fd_out1_0__or0000)
     BUF:I->O            445   0.648   1.425  rst_IBUF_1 (rst_IBUF_1)
     LUT3:I2->O            4   0.648   0.730  ROM/data<1>1 (i_rom_data<1>)
     LUT4:I0->O            1   0.648   0.500  SEQ/i_ram_diByte_mux0000<1>17_SW0 (N1262)
     LUT4:I1->O            1   0.643   0.563  SEQ/i_ram_diByte_mux0000<1>32 (SEQ/i_ram_diByte_mux0000<1>32)
     LUT4:I0->O            1   0.648   0.000  SEQ/i_ram_diByte_mux0000<1>60 (SEQ/i_ram_diByte_mux0000<1>)
     FDE:D                     0.252          SEQ/i_ram_diByte_1
    ----------------------------------------
    Total                      8.946ns (4.336ns logic, 4.610ns route)
                                       (48.5% logic, 51.5% route)

=========================================================================

To allow me to be more specfic, I will give a snipplet of an example code in the decode phase of 1 opcode.

The following is 1 such case when decoding an opdcode, which is a mov instruction. There are about 100+ opcodes (100+ instructions), which means this case statements has over 100 when statements.

case OPCODE is

--MOV A, Rn
when "11101000" | "11101001" | "11101010" | "11101011" | "11101100" | "11101101" | "11101110" | "11101111" => case de_state is when E7 =>

              de_state <= E8;

          when E8 =>


              de_state <= E9;

          when E9 =>


              de_state <= E10;
          when E10 =>
              --Draw PSW
              i_ram_addr <= xD0;
              i_ram_rdByte <= '1';

              de_state <= E11;
          when E11 =>
              --Draw from Rn
              i_ram_addr <= "000" & i_ram_doByte(4 downto 3)& opcode(2 downto 0);
              i_ram_rdByte <= '1';

              de_state <= E12;

          when E12 =>
              --Place into EDR
              EDR <= i_ram_doByte;
              --close rdByte
              i_ram_rdByte <= '0';

          when others =>

          end case;

I hope you could have a better idea of my vhdl code. I would appreciate any form of help. Thank you!

回答1:

Since you're using Xilinx, I presume you also have access to PlanAhead? Try "Analyze Timing / Floorplan Design (PlanAhead)" (under "Implement Design" -> "Place & Route").

PlanAhead should open, and give you a view of your timing results in the bottom. Pick the critical path (the one with the least slack), right click it and choose "Schematic", which will bring up a graphical view of the involved primitives. You can then right-click the primitives and choose "Expand Cone" -> "To Flops" to get a view of the surrounding components too.

This should help you get a much better idea of what signals are involved. Try tracing the input and output signals to your VHDL code, and focus on that path for optimization.

回答2:

There will be no good answers from this information only; we can only guess what source code produced this hardware.

But it is clear that you need to examine the source, make a hypothesis why it is slow, take action to correct the problem, and test the solution.

And repeat until fast enough.

My guess, given your hint that there is a case statement to decode the opcodes...

one of the arms is something like:

when <some expression involving decode>  =>
   address <= <some address calculation>;

The problem is that often the two expressions are inter-related so that they are evaluated in the same cycle. An example solution would be to precompute the address expression (i.e. in the previous cycle) into a register, and rewrite the case arm as:

when <some expression involving decode>  =>
   address <= register;

If you guessed right, the result will be slightly faster and you have another (similar) bottleneck to fix. Repeat until fast enough...

But without the source AND the timing analysis, don't expect a more specific answer.

EDIT : having posted a fraction of source code, the picture is a little clearer : you have two nested Case statements, each quite large. You clearly need some simplification...

I note that only 2 of the inner case arms assign to i_ram_addr, yet the timing analysis shows a huge and complex mux on i_ram_addr; clearly there are a lot of other case arms that contribute terms to i_ram_addr...

I would suggest that you might have to treat i_ram_addr separately from the main Case statement and write the simplest machine you can to generate i_ram_addr alone. For example I would note that the OPCODE case arm is equivalent to:

if OPCODE(7 downto 3) = "11101" then ...

and ask how simple you can get a decoder for i_ram_addr alone. You may find that a lot of other case arms do very similar things with i_ram_addr (the original 8051 designers would have jumped at the chance to simplify logic!). Synthesis tools can be quite clever at simplifying logic, but when things get too complex they can miss opportunities.

(At this stage I would comment out the i_ram_addr assignments and leave the rest of the decoder alone)