Maximizing parallelism in FPGAs – a discussion

General

Maximizing parallelism in FPGAs – a discussion

Came across an interesting discussion on FPGAs and parallelism on LinkedIn:

The original poster, Karl Stevens, starts off by asking,

In the context of an FPGA, parallelism is an important item. Which kinds of parallelism apply and how much performance is gained? Compared to what?
Most kinds of parallelism apply to cpu’s. FPGAs can do register transfers and ALU functions in parallel, that is if there are multiple alu’s available. FPGA is a special purpose processor? 

and goes on to elaborate:

The kinds of parallelism that I have found:
bit level — simply make the data wider.
instruction level — involves pipelining if the process is done in stages
data — same calculation on different sets of data — implies parallel; ALU’s, data sources
task — different calculation on same or different sets of data — same data implies shared data 

These are conceptual and not precise.
Data width is a non issue.
Pipelining usually implies an “instruction” kind of control where the execution/assign step depends on an operator code and the other stages get the data and dispose of the result.
On chip memory is so fast that a dual port ram can read and write simultaneously in about a third of a typical clock interval. Staging may only apply when external memory is used.
Parallel data — use the same controls for parallel alu’s and memories
Parallel tasks — repeat the above for the specific task.
The synthesis tools focus on extracting “parallelism” from an imperative program and so far are reported to be almost as good as a college student using classroom assignments as inputs.
When HLS. ESL. or whatever mature, what will be the measurable benefit? Is it the pie in the sky dream that C programmers can design FPGAs? Or will it be performance? How much?

In terms of architecture, FPGAs certainly seem well-suited for parallelism. They are like blank slates; you are free to design the logic, balance out the area and speed requirements plus the interfacing to external logic and you’re done.

Highlighting some comments:

  • “Okay, the marketing people tell me that I just throw as much logic at my problem as I like, especially for DSPs, however that depends 100% on how my design works with the rest of the system. So it’s never as simple as that, isn’t it?”
  • “What are the trade-offs between pipelining and parallelism? For example, in a mixer design where you have data sampling followed by processing via an FFT, and then data output?”
  • “Do these questions apply similarly to software architects when designing their systems?”
  • “Does flexibility to place your logic kill performance?”
  • “Would an array of general purpose processors offer similar performance benefits as an array of FPGAs, given that processor technology has been pretty much commoditized?”
Varun Nagpal stated some ways to think about implementing algorithms:

Now algorithms have many associated dimensions which can help to decide whether one needs to make a dedicated circuit architecture on FPGA/ASIC or if a specific class of processor can be used
-. Decomposability: iterative, recursive.
-. Control intensive i.e. react to events, data intensive i.e. transforms input to output. or a combination
-. Data storage requirements and data precision.
-. Repetition or frequency of specific operations
-. latency(ns,us,ms,s) vs throughput(MOPS or GOPS or TOPS or POPS)

Mike Field revealed some results that he had gotten for a Mandelbrot design:

I built a Mandelbrot Fractal viewer running at 100MHz on a Spartan-3E. It is inherently parallel, as each pixel can be computed in isolation. It is most probably perfectly sutied for FPGA implementation.

It computed “deep” images (those with high average loop iterations per pixel) to within a few % of a single AMD 2GHz core. On a larger FPGA with more 18×18 multipliers the pipeline depth could be doubled or tripled, with a corresponding scaling of throughput. For “shallow” images the throughput is hampered by the limited RAM write bandwidth (it was also servicing a 1024×768 VGA display) so many completed calculations being sent through the pipeline for another round.

Speed comes from being able to pipeline the calculation. In software the it is a loop of a dozen or so instructions – a bit complex multiply, something close to a complex abs() function, two comparisons an 8bit increment, and a jump. On the FPGA I implemented it as a 12 stage pipeline, which gives performance of one loop iteration per clock Careful selection of data representation also allows the best use of the FPGA hardware (36 bit fixed point with external sign), where on a CPU you only get the choice of “double” or “float”.

Depending on your point of view, this is either “Just as fast as an AMD CPU” (which is still pretty good), or 20x times faster (on a per clock cycle basis). It is also very green – it has superb “compute per watt” at less than 2.5W compared to a desktop PC.

And finally, a pearl of FPGA wisdom from a wise man that Mike met, “FPGA design is the art of balancing size, memory and speed” ; )

– Harnhua

Leave a Reply