Of all the good things that OpenCL promises, the most attractive proposition is how different processors and cores in a multi-compute-core system can be utilised and maximised with a single programming framework. The ability to combine processing modules of different capabilities to perform particular tasks is, of course, the heterogenous computing concept that has been talked about over the years.
Host: "Doesn't matter what device runs my C code,
as long as the most effective one does."
Being in FPGA land most of the time, naturally we at dream about how software programmers can easily port relevant functions in C, C++, etc. to FPGAs, smoothly getting 10x and more performance gains. As usage of FPGAs keep expanding into new areas, we also want C-to-RTL tools to work really well so that good FPGA designs can propagate and show their value as efficient and fast processing engines. However, because of the formidable difficulties in producing good FPGA-ready RTL for every possible software function, there have been many false dawns over the years in terms of "Is this finally the go-to C-to-RTL tool?"
Certainly there have been glowing assessments on the business-readiness of OpenCL for FPGA flows - in this particular instance, for financial computation. But without going into what the best OpenCL for FPGA or High-Level Synthesis (HLS) tool is, let's take a look at one of the common issues facing a software developer when attempting a C-to-RTL flow; an issue that can be said to impair the ease-of-use and adoption of OpenCL tools. Both software and hardware (in the traditional meaning of these words) knowledge are required in this type of porting effort.
Loop unrolling
In EE and CS classes, we've been exposed to loop unrolling, a technique that generally trades hardware resources for faster execution times. Loop unrolling requires an understanding of data dependencies and it seems, the underlying hardware resources. The latter is where confusion and some amount of hair-pulling can occur.
Take for instance, this (dated but still relevant) post:
https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/Why-I-can-only-Unroll-2-times/m-p/461728/highlight/true#M1789
Scenario:
Vivado HLS was used to convert a software function into RTL:
void matpro(const float v[15],float d[15]){
char i,i111,i112;
float b_StateVectors[15];
for (i = 0; i < 15; i++) {
d[i]=v[i]*b_StateVectors[i];
}
}
There are no dependencies in the computation, and the OP felt that the tool should be able to use FPGA resources to unroll the loop and reduce the number of loop iterations from 15 to N. N can be 1 if there are enough resources to do the calculation. However, the tool's resulting analysis showed that the loop was only unrolled by 2x, plus some pipelining.
Why only 2x?
With advice from another user, the OP found that setting a ARRAY_PARTITION directive to use registers instead of 2-port memories was the answer. Because if you use a 2-port memory in this case, the amount of data transfer operations that can happen is severely limited.
Implications
One takeaway is that user input is required as part of the C-to-RTL conversion in order to achieve an optimal design. The tools are continuously improving, so over time, parameters such as those mentioned above will be automatically included as the software becomes better at it. On the user's side of things, deeper knowledge of the FPGA devices and design tools will help bridge the gap, assuming the software engineer is interested in those details.
More to come as we explore this area of optimisation to make InTime tweak C-to-RTL flows.
References
- "Is Altera’s OpenCL SDK ready for business?", 2014, Gordon Inggs, Shane Fleming, David Thomas and Wayne Luk, Imperial College London
- "Altera SDK for OpenCL", https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html
- "Xilinx SDAccel Development Environment", https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html