Inferred state machines in VHDL (vs 2-process-machines of all things!)

A few weeks ago I read a blog post by the illustrious MS researcher Prof. Satnam Singh. He writes about his Kiwi project which he describes as “[trying] to civilise hardware design” – as compared to the explicit writing of state machines. His example is a Ethernet processor which simply swaps the source and destination MAC addresses over and retransmits them. He has code in C#, and it looks a lot like the inferred state machine style of VHDL I’ve been toying with for a while. So (finally) I’ve toyed…

Inferred State machines in VHDL

Have a look at the C# source on the page linked to above, and then come back to see how easily it translates to VHDL. Hardly in need of “civilisation” IMHO :)

architecture inferred_sm_simple of ethernet_echo is
begin  -- architecture inferred_sm 
    echoer : process is
        type t_buffer is array (natural range <>) of std_logic_vector(7 downto 0);
        variable buff        : t_buffer(0 to 1023);  -- buffer is a reserved word in VHDL
        variable start       : boolean;
        variable i, j        : integer;
        variable doneReading : boolean;
    begin  -- process echoer 
        tx_sof_n     <= '1'; -- We are not at the start of a frame 
        tx_src_rdy_n <= '1';
        tx_eof_n     <= '1'; -- We are not at the end of a frame
        start        := rx_sof_n = '0' and rx_src_rdy_n = '0';   -- The start condition 
        main : loop          -- Process packets indefinitely 
            -- Wait for SOF and SRC_RDY 
            while not start loop
                wait until rising_edge(clk);
                exit main when resetn_clk = '0';
                start := rx_sof_n = '0' and rx_src_rdy_n = '0';  -- Check for start of frame 
            end loop;
            -- Read in the entire frame 
            i           := 0;
            doneReading := false;

            -- Read the remaining bytes
            while not doneReading loop
                if rx_src_rdy_n = '0' then
                    buff(i) := rx_data;
                    i       := i+1;
                end if;
                doneReading := rx_eof_n = '0';
                wait until rising_edge(clk); exit main when resetn_clk = '0';
            end loop;

            tx_src_rdy_n <= '0';    -- We are not at the start of a frame
            -- Now send an Ethernet packet back to where it came from
            -- Swap source and destination MAC addresses
            tx_sof_n     <= '1';
            for j in 6 to 11 loop   -- Process a 6 byte MAC address
                tx_data  <= buff(j);
                tx_sof_n <= '0';
                if j /= 6 then
                    tx_sof_n <= '1';
                end if;
                wait until rising_edge(clk); exit main when resetn_clk = '0';
            end loop;
            for j in 0 to 5 loop    -- Process a 6 byte MAC address
                tx_data <= buff(j);
                wait until rising_edge(clk); exit main when resetn_clk = '0';
            end loop;
            -- Transmit the remaining bytes
            j := 12;
            while j < i loop
                tx_data <= buff(j);
                if j = i - 1 then
                    tx_eof_n <= '0';
                end if;
                j := j + 1;
                wait until rising_edge(clk); exit main when resetn_clk = '0';
            end loop;
            tx_src_rdy_n <= '1';
            tx_eof_n     <= '1';
            start        := false;  -- No longer at start of frame
            wait until rising_edge(clk); exit main when resetn_clk = '0';
            -- End of frame, ready for next frame
        end loop;
    end process echoer;
end architecture inferred_sm_simple;

It’s a very easy translation from one to the other as you’ll see if you put the two pieces of code side by side. And it comes in at 67 lines. Prof. Singh’s C version comes in at (if you neglect the equivalent of the VHDL entity as I did for that version) around 70 lines. So much for the verboseness of VHDL compared to C :) There is a horrendous (IMHO!) VHDL version also on the MSDN page, which Prof. Singh describes as “yuk!” and I quite agree. Personally, the two process style does nothing for me and obscures the design intent of the code. THe comparison is not direct as it uses a shift register to store the data in (in a more tradtional way), but it’s 89 lines long. If refactored to a single process, you’d save about 15 lines, bringing it to much the same length as the other two versions!

But does it synthesise? Yes… if you have the right tools.

Synplify Pro worked fine. XST doesn’t like the loops within an inferred state machine. And XST has gotten worse recently, as the new parser used for the -6 series of FPGAs doesn’t support inferred state machines at all. You can only use wait for rising_edge(clk) once at the start of a process. I’d be interested to know if Quartus can handle it. [ Update – Enrik informs us in the comments that Quartus doesn’t like it either) It comes out at about 6000LUTs, 8200 register – almost entirely for the buff storage array (8192 registers on it’s own!) which is read asynchronously and Synplify is not able to infer a Block RAM. I have inquired of Prof. Singh how large his C# implementation compiles to – that’ll be very interesting. If the large buffer array can be inferred to a blockram that’ll be a huge win for the C# approach! It has been pointed out that you wouldn’t really want to design this circuit this way (as you end up with a load of flipflops not a RAM block) – I was aware of this when I started, but it’s an exercise in comparing coding styles across languages, not in creating optimal hardware.

Why would you bother?

Well, it saves you having to think of names for your states. For some state machines this is a boon. The downside of this is that carefully chosen names for states can be self-documenting, which is good. In this example, comments like -- transmit the remaining bytes could probably be removed by having a state called transmit_remaining_bytes (although the lazy typist might abbreviate that to trb and then comment it anyway!) And it’s ~10% terser, and (I think) more readable in this case. Less code is usually good :) (yes, I know using one character variable names and squashing it all up might count as “unreadable less code”, but you have to apply a bit of common-sense as well!)

Downsides

It looks weird (but that’s mainly because there’s no code that look like it in mainstream circulation as far as I’m aware).
And you have to type a load of text for each clock tick to infer. wait until rising_edge(clk); exit main when resetn_clk = '0'; (This maybe one argument for a VHDL preprocessor, as it’s not “encapsulatable” in any other way I’ve figured out yet. Suggestions appreciated :) This is a pretty strong down-side IMHO, I loathe repeated code that ought to be encapsulated.
And we’ve yet to see how much worse it is resource-usage wise.
For very “branchy” state machines it may not work out much different in terms of readability – I haven’t got an example to play with.

Code

The code (should anyone want to have a play) is available from Github

10 comments

Tim says:

2014-10-21 at 19:40

I formalized some musings on implicit state machines in VHDL which I published in 2009. Exemplar’s Leonardo synthesis engine could always process these state machines but for some reason the style never took off. (Perhaps because Synopsys could not support them?)

Because the states are implicit it makes it very easy to cut and paste code around without the unnecessary overhead of explicit states. As you mentioned, the only real ugliness is the reset logic which is now distributed state-by-state rather than the typical factored-out version you see in the explicit state machine style. Debugging in simulation or hardware is a little trickier without the explicit state to track which may also be considered by some to be a negative.

Good article!

1. Martin says:
  
  2014-11-24 at 21:11
  
  Thanks for your comments Tim – I think you are spot on with the inference (heh!) that lack of Synopsys support held people back from using various features that Leonardo had.