Meeting performance and data throughput challenges with a new process technology

18-12-2015 |   |  By Allan Davidson

Altera’s Allan Davidson discusses the HyperFlex architecture, based on Intel’s 14 nm Tri-Gate process, and its advantages.

The rate of growth of technology trends is increasing. In the world of high speed networking, 100 Gbps Ethernet is starting to take over from 40 Gbps while the work of the IEEE’s task force establishing the 400 Gbps standard is well under way. According to Cisco, wired devices accounted for nearly 55 percent of IP traffic in 2011 the growth of wireless data traffic is predicted to reach 11.2 Exabytes during 2017. Network speed remains the critical factor of growth but others, such as reducing power consumption are increasingly having an impact driven by environment concerns and energy cost savings. Increasingly, across many industries, equipment manufacturers are looking to FPGAs to meet this need for speed.

Overall system performance is largely determined by data throughput and for many devices, such as FPGAs, the most common used technique used to increase throughput is to make on-chip buses wider and wider. It is not uncommon to use 512 bit, 1,024 bit, or even wider buses in FPGAs but this approach requires increased resource utilisation and power dissipation and it also introduces complexity into performing high-speed comparator functions or checksums across every bit of the bus.

As process geometries continue to shrink, the performance increases but interconnect delays start to dominate the FPGA’s total delay. Unfortunately, despite using a better process, the device architecture needs to change in order to improve the interconnect issue.

An example of a new architecture that uses an innovative approach to the interconnect challenge is Altera’s recently announced Stratix 10 FPGA that uses its new HyperFlex architecture. Based on Intel’s 14 nm Tri-Gate process, the new architecture uses a “registers everywhere” concept that adds bypass-able Hyper-Registers to every routing segment in the FPGA core and at all functional block inputs. Figure 1 shows a bypass-able Hyper-Register where the routing signal can bypass the register and go straight to the multiplexer, or go through the register first. The multiplexer is controlled by one bit of the FPGA configuration memory (CRAM).


Figure 1. Bypass-able Hyper-Register.

Figure 2 shows a small section of the FPGA fabric with nine adaptive logic modules (ALM) and the interconnect routing that connects them. The Hyper-Register location is indicated by the squares at the intersection of each horizontal and vertical routing segment.


Figure 2. Registers Everywhere HyperFlex Architecture.

In order to maximise the performance of a design using this architecture, developers should use a three-step process that is based on register retiming, pipelining and design optimisation. The Hyper Registers allow the use of familiar design techniques that make it possible to increase the performance of the design significantly past what is possible with conventional FPGA architectures. Table 1 shows the likely performance gains achieved by each of the steps possible by the use of the Hyper Registers. Using these registers in place of the traditional ones the techniques are called Hyper-Retiming, Hyper-Pipelining and Hyper-Optimisation.


Table 1. Three-Step Process to Maximise Performance Using the HyperFlex Architecture

Retiming the design, using the Hyper-Registers in the interconnect routing, requires very little developer effort yet the results yield an average performance gain of 1.4x. This improvement is measured in Altera’s Stratix 10 devices when compared to previous generation of high performance FPGA devices. By removing the critical paths, Hyper-Retiming moves registers out of the ALMs and into the interconnect, balancing register-to-register delays and allowing the design to run at a faster clock frequency.

Because there are Hyper-Registers throughout the interconnect, the register location is fine-grained. Conventional retiming requires additional FPGA logic and routing resources and requires the design to be recompiled, refitted, and rerouted. In contrast, Hyper-Retiming does not use any additional FPGA resources and is performed after place-and-route, providing a significant core performance boost with little or no designer effort.

In the case of Hyper-Pipelining, the design is pipelined and retimed using the Hyper-Registers. This technique requires minor user effort and results in an average performance gain of 1.6X for Stratix 10 devices compared to previous generation high-performance FPGAs. Hyper-Pipelining eliminates long routing delays by adding additional pipeline stages in the interconnect between the ALMs, allowing the design to run at a faster clock frequency. Again, the Hyper-Registers located throughout the interconnect allow a fine-grained selection of the register location. As with Hyper-Retiming, Hyper-Pipelining does not use additional FPGA logic and routing resources, and it is done after place-and-route.

After accelerating data paths with Hyper-Retiming and Hyper-Pipelining, some designs are limited by control logic such as long feedback loops and state machines. To achieve higher performance, it is necessary to restructure these logic sections to use functionally equivalent feed-forward or pre-compute paths instead of long combinatorial feedback paths. This Hyper-Optimisation method requires a bit more effort, depending on the design; however, it results in average performance gains of 2X or more in the case of Altera’s Stratix 10 devices compared to previous generation high-performance FPGAs. In a conventional architecture, this process is called design optimisation. In the HyperFlex architecture, this process is called Hyper-Optimisation because the Hyper-Registers apply the benefits of Hyper-Retiming and Hyper-Pipelining to the feed-forward or pre-compute paths.

In order to take full advantage of the performance improvements that the HyperFlex architecture and Hyper-Register innovation enables, it is essential to use a fully optimised tool chain. Altera has developed a powerful set of new tools integrated into its Quartus II design software. This Hyper-Aware software helps system designers take full advantage of the HyperFlex architecture and maximise the developer’s design productivity.


Figure 3. The Hyper-Aware Design Flow.

This new tool guides the user through the performance optimisation process by identifying performance limiting areas of the design, identifying where and how many pipelines could be used to boost performance, and highlighting critical control- path bottlenecks (such as long feedback loops). This “Fast Forward Compile” tool also allows designers to better predict the performance of their new design compared with previous device generations.

The Hyper-Retimer step occurs near the end of design compilation. It performs post place-and-route performance optimisation using the Hyper-Registers for optimal fine-grained Hyper-Retiming. This step also allows the user to implement Hyper-Pipelining much more easily than conventional pipelining. The Fast Forward Compile report identifies which clock domains can benefit from pipeline stages and how many pipeline stages are needed. After the designer modifies the RTL and places the prescribed number of pipeline stages at the boundaries of each clock domain, the Hyper-Retimer automatically places the registers within the clock domain at the optimal locations to maximise the performance. This auto-placement along with the Fast Forward Compile report makes pipelining easier than ever.

The combination of a new FPGA architecture and the 14 nm Tri-Gate process has resulted in significant power and performance characteristics compared to previous device generations. These improvements greatly assist system developers deliver their products to meet the technology goals today’s markets demand.


By Allan Davidson

Related articles