Why use an FPGA?

“Please help me do this on an FPGA”

The question you shouldn’t ask!

A common refrain on many of the internet’s finest help forums and newsgroups is “I’m trying to do x using an FPGA, help!” And very often “x” is a task which would be more optimally (by many different measures!) be performed in another way. But there is a common assumption that if a task is “intensive” then FPGAs are the answer. One recent example was asking how to implement face-detection in FPGA. It quickly became apparent the the poster didn’t actually know how to perform face-detection at all, so adding FPGAs to the equation was not a great help!

For a quick answer to the question “why use an FPGA?”, I’ll reproduce this list that I used in a lecture to a bunch of undergraduates:

Use an FPGA if one or more of these apply:

  • you have hard real-time deadlines (measured in μs or ns)
  • you need more than one DSP (lots of arithmetic and parallelisable)
  • a suitable processor (or DSP) costs too much (money, weight, size, power)

And for students, there’s one more:

  • Because your assignment tells you to :) Although ideally it’ll be something that is at least representative of a reasonable FPGA task (not a traffic light controller or vending machine!)

What to do instead?

Software’s easy

Writing software, I’d hazard a guess that even amongst embedded software engineers (those that work at the really low level, not writing code for PCs) many of them don’t really know what their target processor looks like under the hood. They just click compile, wait maybe 10 seconds, and test. And that’s great, it makes for a very productive development environment. When you are sufficiently abstracted from the architecture, there’s enough performance from the tools and chips that you don’t (often) have to think hard about how to implement things, you can just get on with the interesting bit, creating your application.

FPGA’s hurt

In comparison, FPGAs are painful to use – don’t get me wrong, the software tools and the silicon architectures have improved massively over the last few years – but compared to writing software, it’s a completely different realm. You have to be much more aware of the architecture of your device, know much more about how the tools operate, wait ages for them to run (and think fundamentally differently about algorithms and implementations). FPGA code takes tens of minutes to compile, and it’s much easier to push up against the performance limits, and then have to mess around with your code to make it more recognisable to the tools you are using.

Choosing an implementation

My advice is always “Avoid using an FPGA unless you have to”. And I say this as a great advocate of FPGAs!

If you can do it in Octave or Matlab on a PC, do so. In fact, even if you end up somewhere else, start from there so you can understand the problem properly.

If you don’t have enough processing power, make use of a GPU.

If that solution costs too much (in money, power, size, weight terms) then you’ll have to get cleverer. Start thinking about microcontrollers. They’re well-tooled up and very powerful. You can have an 80MHz 32-bit ARM for a few pounds (or Euros. Or Dollars) these days, you can do an awful lot with that.

If you’re still struggling for processing power, think about a DSP. But be careful – analyse what you are trying to do very carefully. Figure out which bits will suit a DSP (lots of multiplying and adding in parallel with memory moving) – suddenly you have to know your architecture, just to decide if it’s feasible. Be careful about memory bandwidth, caches are not magic and if your code requires data reads or writes that are randomly scattered about, expect to lose some performance.

The next stage on might be multiple DSPs… and once you start considering multiple DSPs, it might finally be time to think about an FPGA. The downside is you are responsible for so much more of the architecture. Floating-point maths is becoming a sensible option, but you’ll still want to look at the trade-off between development time using floating point and the device-size cost and power savings that come from using a fixed point implementation. You can take advantage of your knowledge of data access patterns to tune the memory controller – in fact you’ll probably have to – yet more grovelling around in the details. Add to this, the fact that it’s a lot harder to hire good FPGA people than DSP people (and even harder than microcontroller people), and help on the internet can be harder to come by. You development time will lengthen as you build simulation models of the hardware you are talking to and have to debug them. And the hour-long build times will try your patience.

But if you have good reason to, go for it!

FPGAs are really well suited to

  • many image and radar processing tasks especially when cost, power and space constrained. (Disclosure: I wrote the first article)
  • financial analysis (when time constrained)
  • seismic analysis (lots of money at stake, the faster you process, the more processing you can do, and the less risk to your drilling)
  • hard-real-time, low-latency deadlines – single-digit microsecond response times to stimulus. See the second page of this flyer – I’ve worked on this project too.

Flash memory through the ages

I was reading bunnie’s recent post on the manufacturing techniques used in USB flash-drives… bare die manipulated by hand with a stick!

Today I found an old (128MB!) SD card from my Palm Tungsten-T. Circa 2005 if I remember rightly. Very different technology, actual chips soldered down on the board. And it’s clear that the SD card form factor was very much defined by the physical size of the NAND flash chips available at the time!

The innards of an old SD card

More ARM FPGA

A while ago I compared Altera and Xilinx’s ARM-based FPGA combos. More information is now available publicly, so let’s see what we know now…

One thing that’s hard to miss is that Altera are making a big thing of their features to support applications with more taxing reliability and safety requirements.

Altera’s external DRAM interface supports error-checking and correction (ECC) on 32-bit wide memory, whereas Zynq can only do this on 16-bit wide memory, allowing Altera to keep a higher-performance system with ECC. The Altera SoCs also claim ECC on the large blocks of RAM within the processor subsystem (ie the L2 cache and peripheral memory buffers). It appears that Zynq only has parity (ie error checking, but not correction) on the cache and on-chip memory. In Xilinx’s favour, they have performed lots of failure testing (they always have – to a heroic degree!) and the entire processor subsystem has a silent data corruption rate of about 15 FIT. Not seen any FIT data for Altera yet.

Both vendors have memory protection within the microprocessor section to stop errant software processes stomping on each other’s data, but Altera appear to have additional protection within the DDR controller too, which presumably protects against accesses from the FPGA fabric going where they shouldn’t. Again, Zynq does not (as far as I can see) provide this feature.

Looking “mechanically”, Altera have devices which are pinout compatible with and without their many-gigabit-transceiver blocks, which would provide one of my applications with a useful development interface which could be dropped in production without a board respin.

Finally, Altera also have a single-core option. Of course, that only makes any difference if it saves enough money to make the silicon cheaper in any applications which can get away with a single core. Xilinx have clearly decided not… we’ll have to see!

Now running WordPress

This site is now running WordPress, rather then Drupal. Ultimately, I got fed up with the very tedious processes involved with managing a Drupal installation. This added to the fact that I somehow got the Image plugin broken such that I couldn’t upload any more images, and despite much Googling, couldn’t fix it. And this was the second time that had happened – the first time I fixed it by restoring from a backup… but that’s not really a proper solution!

So, here we are in WordPress land – updates involve me clicking a button and waiting… much simpler. Bear with me while I find all the little bits of formatting which are no doubt broken, especially as the markdown had to be hacked on by hand, and the code formatting is not (so far) as well configured as my Drupal setup.

Server upgraded

Parallelpoints.com is now running Squeeze (or Debian 6.0 as it’s more formally known).

I’m not a full-time admin, so I greatly appreciated the Debian upgrade guide – it reminds you of all the stuff you have known in the past, but have “swapped-out”, and what tasks to do in what order. In particular, the kernel changes from Debian 5 to 6 were significant enough that a potentially unbootable system may have resulted.

MYSQL broke during the upgrade – the dist-upgrade process appeared to install mysql-client rather than mysql-server which meant there was no server when rebooted, so if you noticed an error page for a short while, apologies. (Not that I expect anyone noticed as the background traffic is pretty low :)

And I allowed the upgrade to change more of my apache2 config than I should have – but that was a quick fix.

Stopping Laserjet 5 jams

Thanks to a kit of new rollers from Daytona plc and detailed service manual from HP for my Laserjet 5M, it’s now printing without jams again!

Should last another 100k pages now I hope

(Note to self for next time – replace the upper feed roller before the lower one as the sprockets will be easier to engage with the drive belt that way around.)

Aerial (or antenna!) wiring

We’ve been having a bunch of building work done, and today we finally moved the TV back downstairs to the new room! Built the new TV bench, hauled all the kit downstairs, plugged in… “No signal” said the TV. On fighting through the ivy to where the downlead comes down the house, found that the electrician hadn’t connected it to the new aerial wiring. Apparantly even these new-fangled digital signals don’t travel well from one piece of coax to another through several feet of air. My fix involved using some bicycle toe-clip mounting brackets, an old mints tin (thanks Ben!) and (the only piece of actual electrical materiel) a piece of choc-block. You can see the results in the picture – quality bodge or what :)

Splicing coax in a mint tin

]1 Splicing coax in a mint tin

FPGAs and ARMs – a summary

Today, I compared the new combined ARM and FPGA devices from Xilinx and Altera. This post summarises that rather long post!

Summary

Well, there’s two interesting new series of devices.

Both chip families look awesome (that’s not a word I habitually use, unlike in some parts of the internet… consider it high praise :). I foresee all sorts of unforeseen applications (if you’ll forgive the Sir Humphry-ism) enabled by the tight coupling of processor and logic.

Can you choose between them?

Well, Xilinx’s Zynq has more memory tightly coupled with the processors, maybe a little less on the FPGA side. Zynq also has the XADC, which shouldn’t be overlooked. A single-chip radar processor is feasible with the combination of XADC and large scratchpad.

Altera have a more flexible FPGA to processor-memory interface, but Xilinx’s looks eminently good enough.

Xilinx have a lot less details published as yet, so there’s no doubt more good stuff to learn from them, and Altera clearly still have things up their sleeves.

I’ll update here as more information becomes available.

FPGAs considered ARM-full

Xilinx and Altera have both announced FPGAs with hard ARM processors on them. Xilinx have even got a new product famliy name (Zynq) for them. The products are potential game-changers in some applications. The combination of a high-performance application processor (or two!) tightly coupled to a large array of customisable logic, memory and DSP elements hasn’t really been done like this before. Consider: a PCIe FPGA board is 100s of microseconds or even milliseconds away from a host processor currently. With these architectures, the FPGA logic can be as little as a few dozen clock ticks (maybe a single microsecond) away. Altera and Xilinx are claiming 800MHz for the processors’ clock. For intensive applications (image-processing for instance) the algorithms you can contemplate are different to those which make sense on an Intel processor and memory system. The logic can be tightly coupled to its own memory subsystem as well as the processor shared memory, and data can to-and-fro between them with very small latency making “interactive” software/hardware acceleration a reality. So, is there any difference between them? I’ve trawled the publicly available information on both platforms (which is not overly detailed as yet) to see what I can glean.

Processor system

Both are using a dual-core Cortex-A9 with NEON extensions – a monster of an embedded processor. In raw clock terms it’ll be 5-10x faster (both vendors claim 800MHz) than a soft-core. There’s double-precision floating-point in the main core and a vectorised SIMD engine for DSP assistance. So, let’s go a bit deeper and compare some more gory details:

Memory

  • Altera: L1 2x32K per core, L2 512K shared, 64K RAM
  • Xilinx: L1 2x32K per core, L2 512K shared, 256K RAM Very similar – but more tightly-coupled RAM for Xilinx. This feels very significant, particularly in data-intensive applications. And I can’t help feeling these chips are not right unless you have a data-intensive application in mind!

Hard peripherals

  • USB OTG: Both have 2 ports
  • Gigabit Ethernet: Both have 2 ports with support for IEEE1588 timestamping (Altera also claim a Low-Power Idle mode)
  • Controller Area Network (CAN): both sport 2 ports, useful for automotive and industrial networking.
  • IIC: Xilinx have 2 on Zynq, Altera have 4
  • SPI: 2 on each
  • UART: 2 on each
  • SD/SDIO is supported by both vendors – Altera state they can boot from SD.
  • NAND flash – 8 bit for Altera, Xilinx not specified as yet. Xilinx’s static memory interface also supports NOR flash, which Altera are not going to. They say you can build your own in the fabric, which is fair comment – I’m not sure how much use parallel NOR flash will get when NAND and QSPI are there.
  • Quad-SPI is available on both devices.
  • both have an array of timers and GPIOs also.

Another significant difference: Xilinx also have an ADC on chip (XADC) which has been in their high-end families for system management for quite a long time. No evidence of Altera providing anything similar. This is quite a useful addition on Xilinx’s part as most reasonably critical systems will want to monitor temperature and power rails at the very least.

To shuffle all the data to and from these peripherals, both have 8 channel DMA engines; the high performance (USB, Ethernet etc) peripherals also have their own bus mastering capabilities. Xilinx have dedicated 4 DMA channels to the FPGA fabric. Altera says their DMA and fabric are connected, but nothing more specific. The peripherals on both vendors’ devices are wired to pins through a big multiplexer. The implication is that you can’t use all of the peripherals at the same time, although it looks like some of them can also be routed through the FPGA fabric to other IOs. The high-speed ports are more restricted on their pin options. On Altera’s version they have documented the various options – one obvious niggle is that one of the USB ports shares pins with an Ethernet port. I guess for most use cases it’s either two Ethernets or two USBs that are needed, rarely 2 of both. But I imagine that’s one of the ports that can’t be sent through the fabric, as ULPI is a bit special.

SDRAM controller

Both vendors have hard DDR controllers with built-in bandwidth management of the array of ports – something quite costly to build into a soft-core memory controller. Both vendors support DDR2, DDR3 and LPDDR2. Altera’s controller also supports LPDDR1. Altera claim support for ECC on both 16- and 32-bit widths. Xilinx aren’t saying at the moment. The SDRAM controllers have many connections to the FPGA fabric, as you’d hope: Altera have sufficient wiring to the FPGA logic to make 3 or 4 bidirectional ports (depending on bus interface) or 1 very wide port (256-bit!) in each direction, or various combinations in between. Xilinx simply have 4 64-bit ports to the FPGA fabric.

Booting

The phrase used in Xilinx’s white paper is “processor-centric”. The Zynq devices are definitely being positioned by Xilinx as completely different beasts to normal FPGAs – hence the family gets its own name. This is a quite clearly a processor with an FPGA on the side. Zynq’s processor boots before the FPGA and then you use the processor to configure the FPGA. Altera are selling their family as more of a middle-ground “FPGA+processor on the same chip”, with boot-flexibility being part of their message. Either the processor or the FPGA part can boot first, with the first up configuring the other part. The processor can boot from QSPI, SD or NAND flash. The FPGA boots (well, OK, configures) with the usual traditional modes (parallel, serial) as well as PCIe – or presumably from the processor system. Personally, I like the Zynq approach – I want to forget the FPGA until I need it.

FPGA fabric

To get data to and from the fabric, Altera have 2 ports mastering to the FPGA (one fast, one slow) and 1 master from the FPGA. In addition there are the memory ports mentioned above. Also worth noting is that in the larger devices 1-3 more hard memory controllers connected directly to the FPGA fabric. Xilinx have 2 ports in each direction between the processor and FPGA and 4×64 bit ports from FPGA to memory. Any further memory interfaces will have to be built in the fabric, although even the low-end devices get the benefit of the Series 7 IOs, which means the PHY interface requires less heroic use of LUTs as delay lines to match DDR timings.

On the logic side, it looks like Altera are using the same adaptive logic module (ALM) – an 8-input fracturable LUT+4FFs+sundry carrychains and muxes – in both Cyclone V and Arria V: from 25K to 460K LEs across 5 family members. (Those must be marketing LEs, not actual ALMs!) Xilinx have a configurable logic block (CLB) – consisting of 8 6-input (somewhat-split-able) LUTs and 16FFs – again the same in both Artix-based and Kintex-based chips. They are claiming 30K to 235K LEs – again those must be marketing numbers. Those numbers are not directly comparable, but it looks like Altera’s biggest device may be significantly larger than Xilinx’s largest. Both vendors are offering ~200KB to ~2MB of FPGA-based memory (up to nearly 3MB at the to-end of Altera’s offering). Yes, those are mega-bytes, not the usual megabits that FPGAs used to get built with!

Summary

I’ve summarised all this

in another post!

-–| Edit -–| And a follow-up here