I had coded the 80c51 architecture in VHDL using xilinx. In an attempt to increase the clock frequency, I had pipelined all the 80c51 instructions. The instructions were able to execute as desired, for eg. when the 1st instruction is being processed, the second instruction gets fetched.
However, I only get a slightly higher clock frequency of (around +/-10Hz) despite creating a pipeline depth of 3, from the synthesis report. I figured out that the bottleneck is due to one operation as specified by the synthesis report, but I could not understand synthesis report.
May I ask what is the data path from 'SEQ/decode_3 to SEQ/i_ram_addr_7' trying to do? (From my guess, i deduce that the use a case, when statement to check the 100+ relevant opcode but not sure if that is the bottleneck. But I am clueless)
Hence, my only 2 queries are:
Firstly, is it possible that pipelining does not increase the clock frequency and the testbench is the only way to explain the reduce in timing?
Secondly, how could I deduce which path in my code that is the bottleneck from 'SEQ/decode_3 to SEQ/i_ram_addr_7'.
Thank you for anyone who can help to explain my doubts!
Timing Summary:
---------------
Speed Grade: -4
Minimum period: 12.542ns (Maximum Frequency: 79.730MHz)
Minimum input arrival time before clock: 10.501ns
Maximum output required time after clock: 5.698ns
Maximum combinational path delay: No path found
Timing Detail:
--------------
All values displayed in nanoseconds (ns)
=========================================================================
Timing constraint: Default period analysis for Clock 'clk'
Clock period: 12.542ns (frequency: 79.730MHz)
Total number of paths / destination ports: 113114 / 2670
-------------------------------------------------------------------------
Delay: 12.542ns (Levels of Logic = 10)
Source: SEQ/decode_3 (FF)
Destination: SEQ/i_ram_addr_7 (FF)
Source Clock: clk rising
Destination Clock: clk rising
Data Path: SEQ/decode_3 to SEQ/i_ram_addr_7
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FDC:C->Q 102 0.591 1.364 SEQ/decode_3 (SEQ/decode_3)
LUT4_D:I1->O 10 0.643 0.885 SEQ/de_state_cmp_eq002111 (N314)
LUT4:I3->O 7 0.648 0.740 SEQ/de_state_cmp_eq00711 (SEQ/de_state_cmp_eq0071)
LUT4:I2->O 3 0.648 0.534 SEQ/i_ram_addr_mux0000<0>11111 (N2301)
LUT4:I3->O 1 0.648 0.000 SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0_F (N1284)
MUXF5:I0->O 1 0.276 0.423 SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0 (N955)
LUT4_D:I3->O 6 0.648 0.701 SEQ/i_ram_addr_mux0000<0>11270 (SEQ/i_ram_addr_mux0000<0>11270)
LUT3_L:I2->LO 1 0.648 0.103 SEQ/i_ram_addr_mux0000<7>221_SW2_SW0 (N1208)
LUT4:I3->O 1 0.648 0.423 SEQ/i_ram_addr_mux0000<7>351_SW1 (N1085)
LUT4:I3->O 1 0.648 0.423 SEQ/i_ram_addr_mux0000<7>2 (SEQ/i_ram_addr_mux0000<7>2)
LUT4:I3->O 1 0.648 0.000 SEQ/i_ram_addr_mux0000<7>167 (SEQ/i_ram_addr_mux0000<7>)
FDE:D 0.252 SEQ/i_ram_addr_7
----------------------------------------
Total 12.542ns (6.946ns logic, 5.596ns route)
(55.4% logic, 44.6% route)
=========================================================================
Timing constraint: Default OFFSET IN BEFORE for Clock 'clk'
Total number of paths / destination ports: 154 / 154
-------------------------------------------------------------------------
Offset: 8.946ns (Levels of Logic = 6)
Source: rst (PAD)
Destination: SEQ/i_ram_diByte_1 (FF)
Destination Clock: clk rising
Data Path: rst to SEQ/i_ram_diByte_1
Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
IBUF:I->O 444 0.849 1.392 rst_IBUF (REG/ext_int/fd_out1_0__or0000)
BUF:I->O 445 0.648 1.425 rst_IBUF_1 (rst_IBUF_1)
LUT3:I2->O 4 0.648 0.730 ROM/data<1>1 (i_rom_data<1>)
LUT4:I0->O 1 0.648 0.500 SEQ/i_ram_diByte_mux0000<1>17_SW0 (N1262)
LUT4:I1->O 1 0.643 0.563 SEQ/i_ram_diByte_mux0000<1>32 (SEQ/i_ram_diByte_mux0000<1>32)
LUT4:I0->O 1 0.648 0.000 SEQ/i_ram_diByte_mux0000<1>60 (SEQ/i_ram_diByte_mux0000<1>)
FDE:D 0.252 SEQ/i_ram_diByte_1
----------------------------------------
Total 8.946ns (4.336ns logic, 4.610ns route)
(48.5% logic, 51.5% route)
=========================================================================
To allow me to be more specfic, I will give a snipplet of an example code in the decode phase of 1 opcode.
The following is 1 such case when decoding an opdcode, which is a mov instruction. There are about 100+ opcodes (100+ instructions), which means this case statements has over 100 when statements.
case OPCODE is
--MOV A, Rn
when "11101000" | "11101001" | "11101010" | "11101011" | "11101100" | "11101101" | "11101110" | "11101111" => case de_state is when E7 =>de_state <= E8; when E8 => de_state <= E9; when E9 => de_state <= E10; when E10 => --Draw PSW i_ram_addr <= xD0; i_ram_rdByte <= '1'; de_state <= E11; when E11 => --Draw from Rn i_ram_addr <= "000" & i_ram_doByte(4 downto 3)& opcode(2 downto 0); i_ram_rdByte <= '1'; de_state <= E12; when E12 => --Place into EDR EDR <= i_ram_doByte; --close rdByte i_ram_rdByte <= '0'; when others => end case;
I hope you could have a better idea of my vhdl code. I would appreciate any form of help. Thank you!
Since you're using Xilinx, I presume you also have access to PlanAhead? Try "Analyze Timing / Floorplan Design (PlanAhead)" (under "Implement Design" -> "Place & Route").
PlanAhead should open, and give you a view of your timing results in the bottom. Pick the critical path (the one with the least slack), right click it and choose "Schematic", which will bring up a graphical view of the involved primitives. You can then right-click the primitives and choose "Expand Cone" -> "To Flops" to get a view of the surrounding components too.
This should help you get a much better idea of what signals are involved. Try tracing the input and output signals to your VHDL code, and focus on that path for optimization.
There will be no good answers from this information only; we can only guess what source code produced this hardware.
But it is clear that you need to examine the source, make a hypothesis why it is slow, take action to correct the problem, and test the solution.
And repeat until fast enough.
My guess, given your hint that there is a case statement to decode the opcodes...
one of the arms is something like:
The problem is that often the two expressions are inter-related so that they are evaluated in the same cycle. An example solution would be to precompute the address expression (i.e. in the previous cycle) into a register, and rewrite the case arm as:
If you guessed right, the result will be slightly faster and you have another (similar) bottleneck to fix. Repeat until fast enough...
But without the source AND the timing analysis, don't expect a more specific answer.
EDIT : having posted a fraction of source code, the picture is a little clearer : you have two nested Case statements, each quite large. You clearly need some simplification...
I note that only 2 of the inner case arms assign to i_ram_addr, yet the timing analysis shows a huge and complex mux on i_ram_addr; clearly there are a lot of other case arms that contribute terms to i_ram_addr...
I would suggest that you might have to treat i_ram_addr separately from the main Case statement and write the simplest machine you can to generate i_ram_addr alone. For example I would note that the OPCODE case arm is equivalent to:
and ask how simple you can get a decoder for i_ram_addr alone. You may find that a lot of other case arms do very similar things with i_ram_addr (the original 8051 designers would have jumped at the chance to simplify logic!). Synthesis tools can be quite clever at simplifying logic, but when things get too complex they can miss opportunities.
(At this stage I would comment out the i_ram_addr assignments and leave the rest of the decoder alone)