Optimizing an FPGA HLS Design with FPGA Tool Settings

General, InTime

Optimizing an FPGA HLS Design with FPGA Tool Settings

Generated RTL code from a C-to-RTL tool is not easy to understand. Here is how to increase design performance without changing any RTL.

Story

High-level design enables a design to be captured in a concise and succinct manner, resulting in fewer errors and easier debugging. However, the oft-repeated concern is performance trade-off. Achieving high performance in a highly complex FPGA design requires manual optimization of register transfer level (RTL) code, which is not possible for generated RTL code from a C-to-RTL development environment. However, solutions exist that can minimize trade-off by optimizing the design itself using the FPGA tool settings.

Find the Right FPGA Tool Settings Efficiently

Although designers are aware of the existence of FPGA tool settings, the settings are often underutilized. Usually, tool settings are used only when they are having a timing problem. However, for designs that have met its performance targets, there is massive potential for additional 10 to 50% performance improvements.

The challenge is in selecting the right tool settings since different FPGA tool presents between 30 to 70 settings for synthesis and place & route. The possible combinations are too many. You can write scripts to create different runs and try the recommended standard directives/strategies. There also are tools which exist that can manage and run design exploration in an automated and disciplined way.

The last challenge issue is insufficient compute power. Typical embedded applications are designed on a single computer. Running multiple compilations requires more compute power. It is a trade-off with time. If you can run more concurrently (using the cloud), the turnaround time will be shorter.

How to Optimize a High-Level Design - Sobel Filter

Here is a reference design commonly used in video processing that does a Sobel Filter Implementation. This reference design targets an FPGA with a Dual ARM® Cortex®-A9 MPCore™.

We use the Xilinx HLS tool to open this design.

Figure 1: Reference Design — Sobel Filter Implementation

Figure 1: Reference Design — Sobel Filter Implementation

It has a clock period of 5.00 ns which is 200 MHz. From the timing estimates (see below), it still missing timing by 506 ps, which translate to 181 MHz, 10% short of its target speed.

Figure 2: Current Timing Results

Figure 2: Current Timing Results

Export to an RTL Project

Without changing the C++ code, export the design into a Vivado project in RTL. Under "Solution", select "Export RTL".

Figure 3: Export to a Vivado project from HLS

Figure 3: Export to a Vivado project from HLS

It will execute Vivado in the background and generate a project file (XPR). It should also compile the design and you should see the actual timing details in the console. Once it is done, locate the project file in the /solution/impl/verilog/ folder.

Figure 4: Locate the Vivado project file

Figure 4: Locate the Vivado project file

You will find a XPR file. You can verify it by opening it with Vivado and you can see the generated RTL source.

Figure 5: Generated RTL from HLS

Figure 5: Generated RTL from HLS

Time to Optimize

The next step is to use a design exploration tool called InTime. (Again, you are free to write a script yourself to try the standard directives or strategies available in the Vivado tools) You can run InTime either on-premise with a free evaluation license. Alternatively, register a Plunify cloud account with some free credits and pre-installed FPGA tools.

After starting InTime, open the project file. When prompted for the Vivado version to use, please use the "same" Vivado version. For example, if you are using 2017.3 HLS, please use 2017.3 Vivado.

Select the "Hot Start" recipe. The "Hot Start" recipe is a recommended list of strategies based on the previous experience with other designs.

Figure 6: Select the Hot Start recipe

Figure 6: Select the Hot Start recipe

Click "Start Recipe" to start the optimization. If you are running on the cloud, you should run multiple compilations concurrently to reduce turnaround time.

Optimization Process and Results

After the first round ("Hot Start" recipe), the best result is the "hotstart_1" strategy. However, it is still missing timing by -90ps.

We applied a 2nd recipe called "Extra Opt Exploration" on the result from "HotStart_1". This focuses on optimizing the critical paths. This is an iterative optimization and will continuously repeat itself as long as each iteration shows improvements. It will eventually stop automatically if it meets the timing target or fail to show improvements.

Figure 7: Timing closure with just tool settings

Figure 7: Timing closure with just tool settings

After 2 rounds of optimizations with a total of 15 compilations, the design was able to meet its performance target of 200Mhz. This is achieved without any changes in the RTLsourcecode.

Next Level of Performance

Getting to the next level of performance requires optimization on all fronts — the architecture design, code and tools. Tools settings exploration can overcome performance trade-offs with higher-level design without losing the benefits of productivity it brings in the first place. It is a win-win for the high-level designer.

Subscribe to Plunify Blog

Enter your email address and have the latest insights on FPGA, cloud and Machine Learning delivered straight to your inbox.

Tags:

Leave a Reply