Using embedded FPGA as a reconfigurable accelerator

By Geoff Tate

CEO, Co-Founder & Board Member

Flex Logix Inc.

Using embedded FPGA as a reconfigurable accelerator

Embedded FPGA is a fairly new technology, but it is quickly finding a home in a range of applications. One use growing in popularity? Connecting it to a processor bus as a reconfigurable accelerator.

While embedded FPGA (eFPGA) is a fairly new technology, it is quickly finding a home in a wide range of applications. One usage growing in popularity is to connect it to a processor bus as a reconfigurable accelerator. Chip designers are finding that this approach offers more flexibility over fixed function accelerators and can deliver significantly higher performance compared to mainstream processors.

One of the key benefits of eFPGA is that is can be reprogrammed to accelerate multiple tasks. Every chip has one or more processors (ARM, ARC, MIPS, etc) executing code. For tasks that occupy much of the processor bandwidth, an accelerator in hardware can often process the task in much less time. The accelerator doesn't replace the processor; but rather it accelerates the most work-intensive task. If the accelerator is reconfigurable, then more than one task can be accelerated, allowing it to handle more workload as required or as different customers/applications demand. eFPGA is reconfigurable, which makes it the perfect choice as an accelerator.

Below are several examples of how eFPGA can be used to accelerate the AXI/AHB bus, as well as the potential performance improvements in each use case. eFPGA can be used to accelerate many types of accelerators, not just ARM. We simply use ARM as an example because it is the most widely used processor today and is easy to verify in terms of performance.

AES-128

In the below example, the AXI4-stream bus for data movement and APB bus for control logic is implemented in the embedded FPGA. Since this interface functionality won’t change, it can also be hardwired externally.

The RTL for this AES-128 accelerator requires 1142 LUTs and fits in a single EFLX-2.5K IP core, which is available in multiple process nodes. In TSMC16FFC, the AES-128 accelerator runs at a worst-case frequency of 374 MHz (-40/125C, 0.72Vjunction, Slow-Slow corner).

This performance is 136-300 times faster than AES-128 software code running on an ARM Cortex M4 in the same process, depending on the assumption of the clock speed of the ARM M4.

SHA-256

In the below example, the AXI4 slave RTL is external to the eFPGA and is used both for accelerator data movement and configuration of the accelerator registers. The AXI4 slave logic is external for lowest bus latency for data movement.

The RTL for this SHA-256 accelerator, operating on 64-byte data blocks, requires 1,634 LUTs and fits in a single EFLX-2.5K IP core, which is available in multiple process nodes. In TSMC16FFC, the SHA-256 accelerator runs at a worst-case frequency of 171 MHz (-40/125C, 0.72Vjunction, Slow-Slow corner).

This performance is approximately 40 times faster than SHA-256 software code running on an ARM Cortex M4 in the same process.

JPEG Encoder

Below is a block diagram of an eFPGA configured as a JPEG encoder. In this example, the AXI4-stream and APB interface logic are shown implemented in the embedded FPGA itself, but this RTL can easily be put outside and hardwired as it won’t need to be reconfigured.

This RTL requires 11,364 LUTs and a significant amount of memory (2 x 256 Kbyte dual port RAMs) which needs to be attached to the eFPGA. The number of signals required to attach to memory is very small compared to the I/Os available. 

In TSMC16FFC (worst case conditions), performance is 149 MHz. This is approximately 31 times the throughput of JPEG encoder software code running on an ARM Cortex M4 in the same process.

256-Point FFT

The below example shows eFPGA configured as a 256-Point FFT accelerator as a Slave/Master on an AXI4-Stream Bus, with the AXI RTL implemented in the EFLX array.

The RTL for this requires 8,360 LUTs and 16 External RAMS (256 words each, dual port). In this example, the RAM is attached inside the array for greater performance. 

The worst-case performance for TSMC16FFC is 303MHz. A benchmark versus an ARM processor is not available, but with the high amount of parallelism in the MACs and memory references, we expect the performance of this reconfigurable accelerator to be much more than a typical processor core.

Conclusion

Using eFPGA as a reconfigurable accelerator delivers significant benefits and can provide performance increases of 30 to 300 times. Its ability to be reconfigured enables eFPGA to accelerate more than one function at a time and allows it to be seamlessly upgraded at any time in response to changing standards or protocols. These advantages are extremely beneficial in processor intensive applications such as data centers, networking, deep learning, artificial intelligence, aerospace, defense, and more.

Geoff Tate is the CEO of Flex Logix.