Vivado Floorplanning – What fails and how to make it work for you.

tldr version:
- Vivado's default floorplan had bad timing.
- Customer ran implementation directives and created their own floorplans, but was still unable to meet the timing target.
- We improved customer's floorplan by generating new intra-die (SLR) floorplans.
- Finally met the timing target using our floorplan and implementation directives.

Figure 1: Floorplan evolution from default Vivado floorplan to Human+InTime generated floorplan

Why Floorplan?

Vivado's decision to place certain instance or cell can sometimes be mind-boggling and/or detrimental to overall performance. As many AMD's FPGA devices are multi-die ones, the die crossings have become the Rubicon of timing closure. Since the P&R decision making process is not transparent to users, 2 years ago we decided to investigate and jump into the deep end of P&R - floorplanning. We developed our own tools to make the task easier. In this post, we will describe a customer's floorplanning challenges and elaborate on how to solve it with these new analysis tools.

What is Wrong with Default Floorplans?

By default, Vivado automates global floorplanning during the "place_design" step. You can see it around "Phase 2.1" in the Vivado log file. During this phase, Vivado decides where major blocks of the design should sit. After that, it proceeds to perform detailed floorplanning and the rest of the placement process. The results of this phase has a major influence on the outcome of your design's performance.

To demonstrate its problem, we will use a customer design on an Alveo XCU55C device. Figure 2 below shows the results of a default compilation. We highlighted the main blocks in the design, which comprises four major blocks (highlighted in red, green, blue and yellow respectively) using AXI interfaces. Let us name them blocks '0', '1', '2' and '3'. Each of them connects to a switch (in magenta).

At first glance, this looks like any other floorplan. However, there is one major issue.

Significant Penalties via the Clock Column

There are two sources of major delay penalties to be aware of for this device. One is in the form of inter-SLR crossings (in blue below in Figure 3), and the other is the "clock" column in the middle of each die (in red). Every time a signal crosses them, it takes a delay penalty of between 0.5ns to 2ns and even higher, depending on congestion and other factors. These crossing penalties are exactly what happened to the customer's design.

Figure 3: Inter SLR crossings (blue) and clock columns (red) have massive delay penalties

Looking at Vivado's default floorplan below (Figure 4), on the plus side, Vivado is smart enough to ensure none of the major blocks are spread across SLR boundaries. However, some of these major blocks ("0" and "3") got spread across the middle of each SLR. This inevitably resulted in high delay penalties (3.7ns!) as some routes must cross the middle clock column. Since the placement decision making is opaque to users, we can only speculate that higher priority factors, such as congestion trade off, is influencing this decision.

These crossings are not the only negative violations. Even for routes that are not crossing, there are negative slacks within each of the major blocks, likely due to high fanout nets and localized congestion.

Here comes Human-powered Floorplanning. But ...

The customer intuitively understands these crossing issues. At a global level, they divided the four major blocks into various halves of the different SLRs. (See "Human Floorplan" in Figure 5). Blocks 0 and 1 are placed on the left halves of the SLR since they communicate with HBM (high bandwidth memory) on the left side, while Block 2 and 3 are floorplanned on the right halves for the same reason.

The customer also had to painstakingly configure the AXI register slice parameters manually to optimize their timing for AXI blocks that span two or three SLRs accordingly. This Human Floorplan avoids almost all the clock column crossing issues, which leaves them to the next challenge - congestion within the crossing-free regions.

Figure 5: Comparing Vivado versus Human floorplans. The Human wins with a 56% improvement in WNS.

Congestion within Major Blocks

The next step is to deal with all the timing violations within each of the four major blocks. Very often, the violations has to do with routes to/from RAM blocks. This task requires more detailed and finer floorplanning on smaller instances within the major blocks. One simple heuristic used is to determine how data flows through the connected blocks, then space out the blocks (see Figure 6). This is a very logical approach to take in our opinion.

Figure 6: Simple heuristic - Black arrows represent signals in/out of this block. White arrows signify internal data flow

All these improvements come at a significant cost in terms of time and effort. Remember that you have to do this for all four major blocks. In conclusion, the final result of the Human Floorplan is WNS = -430ps. With post route physical optimization, the design is able to achieve about -300ps.

However, although this is a vastly improved result, the result is still not acceptable as the target WNS is -50ps or better. So how do we improve this floorplan? Or do we rip it out and start afresh?

A better Human Floorplan with InTime "PCO" Floorplanning

"PCO" is a set of detailed floorplanning features to help users make better decisions within a SLR or in a relatively uniform single-die region.

1. Decide Cuts For Main Blocks

The first step is to view the design netlist as a graph. This can be done using InTime Analysis. When viewing it for the first time, one may taken aback by the complex blob of interconnected nodes and edges. This is what modern FPGA software looks at when it is deciding how to floorplan.

One of your first observations should be that the graph's structure resembles what we described hierarchically earlier. At this level, we started to discern groups of related logic (see Figure 7 below). We nicknamed these type of graphs, "octopus graphs", because of the presence of distinct octopine masses. (Maybe there is a proper technical term for such graph structures, if you know it, please let me know!)

The objective of this step is to partition the design into groups such that Vivado is able to place each group without (or with minimal) crossings - a problem that we highlighted from the beginning. In Vivado, this phase is called "Global Floorplanning". We can guess the designer's intuition to split the design into 4 main blocks, each occupying half an SLR. This idea can be corroborated with the graph in Figure 8 below. Each "octopus" is found to match one of the four blocks.

To keep track of each group of related logic, we enumerated them. While we can develop an overly complex set of algorithms to pick the groups, asking the designer's opinion is much better and easier at this point. (In future versions, we may offer suggestions to what should be grouped.)

You don't have to be overly concerned about what gets included in each group. The accuracy is not critical as your role is to provide guidance. The priority at this stage is to avoid unnecessary crossing routes which incur large delay penalties.

In the above example, the instance "A/B/*" plus some other smaller instances are represented in group "3". With this in mind, we decided to specify "A/B/*" as one of the main instance to do "PCO" floorplanning and follow the user's early decision to assign it to the right of SLR2. We repeated this for the rest of the "octopus tentacles".

2. Generate Detailed Floorplans

Each major block consists of about 10 smaller blocks. To each small block, we assigned a number based on its resource utilization, where smaller numbers represent bigger blocks. (Size is loosely represented by the total number of LUTs/RAM/DSP/FFs that a block contains.)

Within a major block, InTime will pick significant smaller blocks and floorplan them based on how connected each block is connected to the other blocks. The minimum region of assignment is a clock region. The objective is to ensure there is adequate space between each block but yet packs them closely enough for optimal performance. As you can guess, the number of possible solutions is extremely high.

For example, if we select five significant smaller blocks to floorplan in a 4x4 grid (half SLR or 16 clock regions), you can easily generate many possible solutions. Two such solutions, 'A' and 'B', are shown below in Figure 9. Each block has a different shape and size, and a unique border color.

Figure 9: Two sample floorplans for five blocks of different sizes

Since there are so many ways to arrange these blocks, we created a metric to rank the solutions - accomplished using a "connectivity score". In a nutshell, the "connectivity score" measures the distance between each group, taking into account its resource usage and its connectivity. If you pack it too tightly, it will end up becoming over-congested. If they are too far apart, the timing will not be good.

The InTime software tool does the heavy lifting of generating and evaluating thousands of miniature floorplans, and filters out the best solutions (can be more than 1).

3. Final Floorplan

After iterating over the possible solutions, the final solution looks something like the one below. Compared to the Human Floorplan, it is 23% better in terms of the connectivity score that we defined earlier. There are several advantages over the human-generated floorplan.

Figure 10: Compared to human floorplan, InTime generated Floorplan's connectivity score is 23% better.
The lavender regions used to be in the same pblock in the human floorplan, but are now separated in the InTime floorplan.

A. Selection of Cells

Typically designers floorplan their designs using the design's hierarchy as a reference. For example, we will intuitively add cells belonging to the same hierarchy XXX/YYY/* to the same floorplanning region, where XXX/YYY are hierarchical instance names. Our human brains are constrained by how the design is structured and named, and find it very difficult cognitively to select different groups of cells from different hierarchy levels and then add them to a floorplanning region.

InTime automatically groups of related logic using a graph. So you can have the flexibility of specifying hierarchical instance names to InTime, and InTime will automatically split and group them for you. Alternatively you can also specify these generated groups based on the graph.

Compare the Human floorplan versus the InTime floorplan above in Figure 10. The purple regions are floorplanned by the user into a region that spans clock regions X1Y10 to X3Y12. However, InTime splits the same logic into three smaller blocks, mixed with cells from other instances since this is more optimal from the graph's point of view.

B. Solve Overlapping Regions

It is hard for designers to decide if two regions should overlap, how much to overlap and if they can overlap due to resource constraints. These questions also relate to how closely you should pack your design. In newer FPGA devices, clock regions are less uniform. One common misconception by users is that the available resources are the same in the left and right sides of an SLR. This problem is solved automatically by InTime since it knows the exact utilization of the cells to be floorplanned versus the resource availability. It also takes into account the connectivity between groups of cells.

4. Final Step -Run with directives

The InTime floorplan is able to produce post-place WNS that is 2x better than the customer's Human floorplan and post-place TNS that is 10x better. With the improved InTime floorplan, we then ran compilations that explored different implementation directives, and one of the results achieved the target timing.

Figure 11: PCO Floorplan with and without directives

Challenges with Floorplanning

Vivado (and every FPGA tool) doesn't listen to you

Typically we do not floorplan every block in the design and only the blocks we wish to target. One assumption is that the rest of the non-floorplanned blocks will remain in the same regions based on your last compilation. This is not true. Very often, a random seed, directive, or a property, can jolt an entire block to a new position. This volatility has knock-on effects and can impact your WNS or TNS target.

Figure 12: Another design with vastly different regions for the non-floorplanned cells (indicated by white arrows)

Solution? Our advice to limit volatility is to floorplan and fill up a whole SLR as much as you can. For example, if you are targeting a block for SLR0, try to floorplan the whole SLR0. This removes room for other blocks to be inserted into SLR0, causing unnecessary congestion which deteriorates the exact timing violations that you are trying to solve. This is one of the reasons that PCO was created for.

To learn more or share with us your floorplanning experiences, contact us directly at tellus@plunify.com or approach one of our sales representatives.