Author Archives: Martin

FPGAs considered ARM-full

Xilinx and Altera have both announced FPGAs with hard ARM processors on them. Xilinx have even got a new product famliy name (Zynq) for them. The products are potential game-changers in some applications. The combination of a high-performance application processor (or two!) tightly coupled to a large array of customisable logic, memory and DSP elements hasn’t really been done like this before. Consider: a PCIe FPGA board is 100s of microseconds or even milliseconds away from a host processor currently. With these architectures, the FPGA logic can be as little as a few dozen clock ticks (maybe a single microsecond) away. Altera and Xilinx are claiming 800MHz for the processors’ clock. For intensive applications (image-processing for instance) the algorithms you can contemplate are different to those which make sense on an Intel processor and memory system. The logic can be tightly coupled to its own memory subsystem as well as the processor shared memory, and data can to-and-fro between them with very small latency making “interactive” software/hardware acceleration a reality. So, is there any difference between them? I’ve trawled the publicly available information on both platforms (which is not overly detailed as yet) to see what I can glean.

Processor system

Both are using a dual-core Cortex-A9 with NEON extensions – a monster of an embedded processor. In raw clock terms it’ll be 5-10x faster (both vendors claim 800MHz) than a soft-core. There’s double-precision floating-point in the main core and a vectorised SIMD engine for DSP assistance. So, let’s go a bit deeper and compare some more gory details:

Memory

  • Altera: L1 2x32K per core, L2 512K shared, 64K RAM
  • Xilinx: L1 2x32K per core, L2 512K shared, 256K RAM Very similar – but more tightly-coupled RAM for Xilinx. This feels very significant, particularly in data-intensive applications. And I can’t help feeling these chips are not right unless you have a data-intensive application in mind!

Hard peripherals

  • USB OTG: Both have 2 ports
  • Gigabit Ethernet: Both have 2 ports with support for IEEE1588 timestamping (Altera also claim a Low-Power Idle mode)
  • Controller Area Network (CAN): both sport 2 ports, useful for automotive and industrial networking.
  • IIC: Xilinx have 2 on Zynq, Altera have 4
  • SPI: 2 on each
  • UART: 2 on each
  • SD/SDIO is supported by both vendors – Altera state they can boot from SD.
  • NAND flash – 8 bit for Altera, Xilinx not specified as yet. Xilinx’s static memory interface also supports NOR flash, which Altera are not going to. They say you can build your own in the fabric, which is fair comment – I’m not sure how much use parallel NOR flash will get when NAND and QSPI are there.
  • Quad-SPI is available on both devices.
  • both have an array of timers and GPIOs also.

Another significant difference: Xilinx also have an ADC on chip (XADC) which has been in their high-end families for system management for quite a long time. No evidence of Altera providing anything similar. This is quite a useful addition on Xilinx’s part as most reasonably critical systems will want to monitor temperature and power rails at the very least.

To shuffle all the data to and from these peripherals, both have 8 channel DMA engines; the high performance (USB, Ethernet etc) peripherals also have their own bus mastering capabilities. Xilinx have dedicated 4 DMA channels to the FPGA fabric. Altera says their DMA and fabric are connected, but nothing more specific. The peripherals on both vendors’ devices are wired to pins through a big multiplexer. The implication is that you can’t use all of the peripherals at the same time, although it looks like some of them can also be routed through the FPGA fabric to other IOs. The high-speed ports are more restricted on their pin options. On Altera’s version they have documented the various options – one obvious niggle is that one of the USB ports shares pins with an Ethernet port. I guess for most use cases it’s either two Ethernets or two USBs that are needed, rarely 2 of both. But I imagine that’s one of the ports that can’t be sent through the fabric, as ULPI is a bit special.

SDRAM controller

Both vendors have hard DDR controllers with built-in bandwidth management of the array of ports – something quite costly to build into a soft-core memory controller. Both vendors support DDR2, DDR3 and LPDDR2. Altera’s controller also supports LPDDR1. Altera claim support for ECC on both 16- and 32-bit widths. Xilinx aren’t saying at the moment. The SDRAM controllers have many connections to the FPGA fabric, as you’d hope: Altera have sufficient wiring to the FPGA logic to make 3 or 4 bidirectional ports (depending on bus interface) or 1 very wide port (256-bit!) in each direction, or various combinations in between. Xilinx simply have 4 64-bit ports to the FPGA fabric.

Booting

The phrase used in Xilinx’s white paper is “processor-centric”. The Zynq devices are definitely being positioned by Xilinx as completely different beasts to normal FPGAs – hence the family gets its own name. This is a quite clearly a processor with an FPGA on the side. Zynq’s processor boots before the FPGA and then you use the processor to configure the FPGA. Altera are selling their family as more of a middle-ground “FPGA+processor on the same chip”, with boot-flexibility being part of their message. Either the processor or the FPGA part can boot first, with the first up configuring the other part. The processor can boot from QSPI, SD or NAND flash. The FPGA boots (well, OK, configures) with the usual traditional modes (parallel, serial) as well as PCIe – or presumably from the processor system. Personally, I like the Zynq approach – I want to forget the FPGA until I need it.

FPGA fabric

To get data to and from the fabric, Altera have 2 ports mastering to the FPGA (one fast, one slow) and 1 master from the FPGA. In addition there are the memory ports mentioned above. Also worth noting is that in the larger devices 1-3 more hard memory controllers connected directly to the FPGA fabric. Xilinx have 2 ports in each direction between the processor and FPGA and 4×64 bit ports from FPGA to memory. Any further memory interfaces will have to be built in the fabric, although even the low-end devices get the benefit of the Series 7 IOs, which means the PHY interface requires less heroic use of LUTs as delay lines to match DDR timings.

On the logic side, it looks like Altera are using the same adaptive logic module (ALM) – an 8-input fracturable LUT+4FFs+sundry carrychains and muxes – in both Cyclone V and Arria V: from 25K to 460K LEs across 5 family members. (Those must be marketing LEs, not actual ALMs!) Xilinx have a configurable logic block (CLB) – consisting of 8 6-input (somewhat-split-able) LUTs and 16FFs – again the same in both Artix-based and Kintex-based chips. They are claiming 30K to 235K LEs – again those must be marketing numbers. Those numbers are not directly comparable, but it looks like Altera’s biggest device may be significantly larger than Xilinx’s largest. Both vendors are offering ~200KB to ~2MB of FPGA-based memory (up to nearly 3MB at the to-end of Altera’s offering). Yes, those are mega-bytes, not the usual megabits that FPGAs used to get built with!

Summary

I’ve summarised all this in another post!

—| Edit —|

And a follow-up here

Me: One, Tiny pieces of plastic: Nil

For want of a better place to log how to get at the dishwasher pump next time it gets stuck with a tiny piece of plastic blocking the impeller, I’m sticking it here.

In case it helps anyone else it’s a Bosch Classixx (no idea what model no, bought about 2004 IIRC). Here are the steps:

* Turn it on its side (right hand side looking at the door – the one nearest the programme dial)
* remove the bottom (remove two screws and need a flat screwdriver to prise it a bit) – it has an earth wire attached, so don’t yank it completely away
* Don’t lose the bit of polystyrene from the anti-flooding switch!
* Follow the drain hose (that goes to the u-bend under the sink) back to the pump
* Squeeze the clips and rotate the pump, it’ll come off fairly easily
* remove tiny piece of debris of some sort.
* Put it all back together again.

Is this the modern equivalent of going out hunting sabre-tooth tigers? Probably not :)

EEWeb electronic design site

I recently became aware of the EEWeb electronic design site (as one of their reps emailed me to see if I’d like my site to appear on their front page… we’ll see if my server can handle that!)

It’s sponsored by Digikey, much like RS‘s DesignSpark and Farnell‘s element14.

There’s an awful lot of content and I’ve barely scratched through a tiny bit of it – worth a trawl!

If you’re visiting from EEWeb, you might be interested in my FPGA or VHDL related writings, or my solderless Drawdio – or maybe something else!

libv has a home

Some of my “useful bits” of library code have lived in libv.vhd for a while – I’ve split it off and licensed it with a CC0 license (which means the author disclaims copyright and offers no warranty). It’s on github and I’ll add contributions from anyone who has any!

Either individual functions to add to libv.vhd or great big wodges of useful code (like Jim Lewis’ randomized testing libraries maybe….)

Should VHDL be extended to allow the use of Unicode

I’m contributing to the VASG group which is working on coming up with what the next revision of VHDL should be able to do.

On today’s conference call, the idea was mooted that VHDL could allow the use of Unicode identifiers (ie entity, signal, variable names etc.).

All of today’s participants were (as far as I recall) native English speakers without much call for accented characters, much less characters from entirely different writing systems. So I’m putting a call out to see if there’s any interest from the wider community in pushing forwards a requirement for VHDL compilers to support Unicode.

Feel free to mail me, comment below or @mention me in a tweet with your thoughts – I’ll summarise the results here in a few weeks

Variables and signals in VHDL – and when variables are better

VHDL has two types of what-would-normally-be-called-variables:

  • signals, which must be used for inter-process communication, and can be used for process-local storage
  • variables which are local to a process.

(there’s also variables of a protected type which we’ll ignore for now as they’re another thing altogether)

Now one of the big differences between signals and variables is the way they are updated. When a signal is used within a process and is assigned to to change its value, the value does not update until the end of the process. For example (assume foo is “ to start with):

process
begin
  wait until rising_edge(clk);
  foo <= 1; -- an update is scheduled
  sig <= foo + 1; -- but hasn't happened yet, so sig is assigned 1 (again, only scheduled)
end process; -- at this point, foo is updated with the value 1 and sig also gets 1

Compare with the situation where foo is a variable:

process
  variable foo : integer;
begin
  wait until rising_edge(clk);
  foo := 1; -- foo gets 1 immediately
  sig <= foo + 1; -- so sig has an update to 2 scheduled
end process; -- at this point, foo is already 1 and sig gets its scheduled update to 2

This can cause some interesting effects on coding style – take for example this question on StackOverflow.

A code example is provided there which shows what has to happen due to the update semantics of signals – deeply nested ifs.

Here’s an alternative. Let’s define a procedure first of all to increment a variable and wrap around if it is greater than some maximum value. Also return a flag to inform us if the variable wrapped:

procedure inc_wrap (i : inout integer; maximum : positive; wrapped : inout boolean) is
begin
  if i = maximum then
    wrapped := true;
    i := 0;
  else
    wrapped := false;
    i := i + 1;
  end if;
end procedure;

Then our update process looks like this:

if second_enable = '1' then
  inc_wrap(s_ls_int, 9, wrapped);
  if wrapped then
    inc_wrap(s_ms_int, 5, wrapped);
  end if;
  if wrapped then
    inc_wrap(m_ls_int, 9, wrapped);
  end if;
  if wrapped then
    inc_wrap(m_ms_int, 5, wrapped);
  end if;
  if wrapped then
    if h_ms_int < 2 then -- wrap at 9 for 0-19 hour
      inc_wrap(h_ls_int, 9, wrapped);
    else -- wrap at 3 if past 20 hours
      inc_wrap(h_ls_int, 3, wrapped);
    end if;
  end if;
  if wrapped then
    inc_wrap(h_ms_int, 2, wrapped);
  end if;
end if;

All the code from this posting is at Github

Tool switches

@boldport asked:

What are your #FPGA design space exploration techniques?

which he expands upon:


“Design space exploration” is the process of trying out different settings and design methods for achieving better performance. Sometimes the goals are met without any of it — the default settings of the tools are sufficient. When they’re not, what are your techniques to meet your performance goals?

Yet again, the 140 character constraint leaves me with things unspoken….

Working where I do in the automotive market means that it’s not good enough to miss timing by a few picoseconds and say “it’ll be fine, ship it”. If you miss timing, you /have/ to make it pass.

My experience with tool tweakery is that it gains you a 2-5% timing improvement – which can be enough to meet timing when you just missed.

The downside is that usually, when you go and change the design (due to the requirements changing yet again), you find yourself with a slightly different 10ps timing violation which maybe this time the tools can’t get around. Or maybe with a change one of the seed parameters, it will, after some trial runs.

So, I’ve given up on that approach as being too variable. It’s much harder to give estimates of when something will be ready when timing closure is a “tweak the knobs a number of times and see”.

What I do now is rework things until it meets timing easily. That way, it’s likely to stay that way.

Techniques include:

  • Pipelining – adding registers
  • Constraining unconstrained integers – occasionally, the synthesiser doesn’t figure out the range an integer variable or signal can take on, so needs telling. This is happening less and less as synthesis tools get cleverer.
  • Simplifying algorithms

This give me a much more predictable build process. It’s seen me fine, even for a nearly full Spartan3 device with some logic running at 160+MHz DDR.

Of course, if you are right up against the limits of the device speed and you’ve pipelined and constrained and everything else, then tweaking tool parameters is all you have left – anyone in that position has my sympathies!

Version control for FPGAs

@boldport recently asked on Twitter what version control software people used on their FPGA designs. I replied that I use git at home and Subversion at work. The reasons why take a bit more than 140 characters, so I’ve written them here!

Subversion

Work first – we were using Microsoft’s Visual Sourcesafe quite happily. Until it started to lose data on us. Not great for a version control tool!

I reviewed a load of version control systems then, and I selected Subversion as our tool of choice for version control.

One of the reasons for this was the price (not surprisingly) – I wanted to encourage everyone to start using version control for all sorts of thigns, not just the “softies”. But no-one was going to pay for project managers to have licenses for a paid-for tool.

The TortoiseSVN client integrated nicely with Explorer, so those who like GUIs are well catered for.

It’s a great tool, and has got really wide usage (yes, even amongst project managers).

It works well for FPGA designs too (but then they’re pretty much just text source code anyway!) – I have a flow which can set me up a new FPGA design by pulling starting points from a library space within our repository very quickly. And I have scripted the release process so that I ensure that a TAG is created with the unique buildid of my FPGAs at the same point as the zipfile I release to other developers is created.

One downside to Subversion is that when library code is pulled in through the svn:externals property the revision of that library code is not locked, so if that tag is pulled at a later date, you can find it pulling a later version of the external reference. There are thigns you can do about this, but you have to be proactive in doing them.

Merging has also been a pain – one of my FPGAs branched a lot at one stage, and Subversion at that stage had no knowledge of the previous branches. Since Subversion gained extra merging abilities, I haven’t had much opportunity to use them :(

If I were choosing again now, I would go with a distributed system – either Mercurial or Bazaar – both of which felt a bit Unixy (I’m in a minority in liking Unix-like systems :) and didn’t have Tortoise-like clients at the time we were making the decision.

Git

At home, I started using Subversion, but when git came around, I jumped on it. I was quite entertained by Linus Torvalds comparison of git and svn – he has a certain way with words :)

Git is certainly not for everyone – it works slightly weirdly compared to Subversion (and indeed Bazaar and Mercurial as far as I can tell).

Starting off is simple, just git init. The speed is brilliant. I love being able to switch between branches instantaneously. And the merging ability is superb.

Again, nothing FPGA specific, it’s just source code.

Inferred state machines in VHDL (vs 2-process-machines of all things!)

A few weeks ago I read a blog post by the illustrious MS researcher Prof. Satnam Singh. He writes about his Kiwi project which he describes as “[trying] to civilise hardware design” – as compared to the explicit writing of state machines. His example is a Ethernet processor which simply swaps the source and destination MAC addresses over and retransmits them. He has code in C#, and it looks a lot like the inferred state machine style of VHDL I’ve been toying with for a while. So (finally) I’ve toyed…

Inferred State machines in VHDL

Have a look at the C# source on the page linked to above, and then come back to see how easily it translates to VHDL. Hardly in need of “civilisation” IMHO :)

architecture inferred_sm_simple of ethernet_echo is
begin  -- architecture inferred_sm 
    echoer : process is
        type t_buffer is array (natural range <>) of std_logic_vector(7 downto 0);
        variable buff        : t_buffer(0 to 1023);  -- buffer is a reserved word in VHDL
        variable start       : boolean;
        variable i, j        : integer;
        variable doneReading : boolean;
    begin  -- process echoer 
        tx_sof_n     <= '1'; -- We are not at the start of a frame 
        tx_src_rdy_n <= '1';
        tx_eof_n     <= '1'; -- We are not at the end of a frame
        start        := rx_sof_n = '0' and rx_src_rdy_n = '0';   -- The start condition 
        main : loop          -- Process packets indefinitely 
            -- Wait for SOF and SRC_RDY 
            while not start loop
                wait until rising_edge(clk);
                exit main when resetn_clk = '0';
                start := rx_sof_n = '0' and rx_src_rdy_n = '0';  -- Check for start of frame 
            end loop;
            -- Read in the entire frame 
            i           := 0;
            doneReading := false;

            -- Read the remaining bytes
            while not doneReading loop
                if rx_src_rdy_n = '0' then
                    buff(i) := rx_data;
                    i       := i+1;
                end if;
                doneReading := rx_eof_n = '0';
                wait until rising_edge(clk); exit main when resetn_clk = '0';
            end loop;

            tx_src_rdy_n <= '0';    -- We are not at the start of a frame
            -- Now send an Ethernet packet back to where it came from
            -- Swap source and destination MAC addresses
            tx_sof_n     <= '1';
            for j in 6 to 11 loop   -- Process a 6 byte MAC address
                tx_data  <= buff(j);
                tx_sof_n <= '0';
                if j /= 6 then
                    tx_sof_n <= '1';
                end if;
                wait until rising_edge(clk); exit main when resetn_clk = '0';
            end loop;
            for j in 0 to 5 loop    -- Process a 6 byte MAC address
                tx_data <= buff(j);
                wait until rising_edge(clk); exit main when resetn_clk = '0';
            end loop;
            -- Transmit the remaining bytes
            j := 12;
            while j < i loop
                tx_data <= buff(j);
                if j = i - 1 then
                    tx_eof_n <= '0';
                end if;
                j := j + 1;
                wait until rising_edge(clk); exit main when resetn_clk = '0';
            end loop;
            tx_src_rdy_n <= '1';
            tx_eof_n     <= '1';
            start        := false;  -- No longer at start of frame
            wait until rising_edge(clk); exit main when resetn_clk = '0';
            -- End of frame, ready for next frame
        end loop;
    end process echoer;
end architecture inferred_sm_simple;

It’s a very easy translation from one to the other as you’ll see if you put the two pieces of code side by side. And it comes in at 67 lines. Prof. Singh’s C version comes in at (if you neglect the equivalent of the VHDL entity as I did for that version) around 70 lines. So much for the verboseness of VHDL compared to C :) There is a horrendous (IMHO!) VHDL version also on the MSDN page, which Prof. Singh describes as “yuk!” and I quite agree. Personally, the two process style does nothing for me and obscures the design intent of the code. THe comparison is not direct as it uses a shift register to store the data in (in a more tradtional way), but it’s 89 lines long. If refactored to a single process, you’d save about 15 lines, bringing it to much the same length as the other two versions!

But does it synthesise? Yes… if you have the right tools.

Synplify Pro worked fine. XST doesn’t like the loops within an inferred state machine. And XST has gotten worse recently, as the new parser used for the -6 series of FPGAs doesn’t support inferred state machines at all. You can only use wait for rising_edge(clk) once at the start of a process. I’d be interested to know if Quartus can handle it. [ Update – Enrik informs us in the comments that Quartus doesn’t like it either) It comes out at about 6000LUTs, 8200 register – almost entirely for the buff storage array (8192 registers on it’s own!) which is read asynchronously and Synplify is not able to infer a Block RAM. I have inquired of Prof. Singh how large his C# implementation compiles to – that’ll be very interesting. If the large buffer array can be inferred to a blockram that’ll be a huge win for the C# approach! It has been pointed out that you wouldn’t really want to design this circuit this way (as you end up with a load of flipflops not a RAM block) – I was aware of this when I started, but it’s an exercise in comparing coding styles across languages, not in creating optimal hardware.

Why would you bother?

Well, it saves you having to think of names for your states. For some state machines this is a boon. The downside of this is that carefully chosen names for states can be self-documenting, which is good. In this example, comments like -- transmit the remaining bytes could probably be removed by having a state called transmit_remaining_bytes (although the lazy typist might abbreviate that to trb and then comment it anyway!) And it’s ~10% terser, and (I think) more readable in this case. Less code is usually good :) (yes, I know using one character variable names and squashing it all up might count as “unreadable less code”, but you have to apply a bit of common-sense as well!)

Downsides

  • It looks weird (but that’s mainly because there’s no code that look like it in mainstream circulation as far as I’m aware).
  • And you have to type a load of text for each clock tick to infer. wait until rising_edge(clk); exit main when resetn_clk = '0'; (This maybe one argument for a VHDL preprocessor, as it’s not “encapsulatable” in any other way I’ve figured out yet. Suggestions appreciated :) This is a pretty strong down-side IMHO, I loathe repeated code that ought to be encapsulated.
  • And we’ve yet to see how much worse it is resource-usage wise.
  • For very “branchy” state machines it may not work out much different in terms of readability – I haven’t got an example to play with.

Code

The code (should anyone want to have a play) is available from Github