Our Latest News

FPGA co-processor based algorithms and bus connections

Today’s design engineers are constrained by area, power and cost to implement embedded designs using GHz-class computers. In embedded systems, it is usually a relatively small number of algorithms that determine the largest computing requirements. Using design automation tools these algorithms can be quickly converted to a hardware coprocessor. The coprocessor can then be efficiently connected to the processor to produce “GHz” level performance.

  In this paper, we focus on code acceleration and code conversion to hardware coprocessors. We also analyze the process of balancing decision making through a benchmark data involving an actual image display case based on an auxiliary processor unit (APU). The design uses an embedded PowerPC implemented in a platform FPGA.

Meaning of coprocessor

  A coprocessor is a processing unit that is used in conjunction with a main processing unit to perform operations that would normally be performed by the main processing unit. Typically, coprocessor functions are implemented in hardware to replace several software instructions. Code acceleration is achieved by reducing multiple code instructions to a single instruction, and by implementing the instructions directly in hardware.

  The most commonly used coprocessor is the floating point unit (FPU), which is the only common coprocessor that is tightly integrated with the CPU. There is no common coprocessor library, and even if such a library existed, it would still be difficult to simply interface the coprocessor to a CPU such as the PenTIum 4. Xilinx Virtex-4 FX FPGAs have one or two PowerPCs, each with an APU interface. By embedding a processor in the FPGA, there is now an opportunity to implement a complete processing system on a single chip.

  The PowerPCs with APU interfaces enable the implementation of a tightly coupled co-processor in the FPGA. Because of the frequency requirements and the pin count limitations, it is not very feasible to use an external co-processor. Therefore a dedicated application coprocessor can be created that connects directly to the PowerPC, greatly increasing software speed. Because the FPGA is programmable, you can quickly develop and test a coprocessor solution connected to the CPU.

Co-processor Connection Model

  There are three basic forms of coprocessors: those connected to the CPU bus, those connected to I/O, and instruction pipeline connections (InstrucTIon Pipeline ConnecTIon). In addition, there exist some hybrid forms of these forms.

1. CPU Bus Connection

  Processor bus connection gas pedals require the CPU to move data around the bus as well as send commands. Often, a single data processing requires many processor clock cycles. Because the bus arbitration and bus-driven clocks are divisions of the processor clock, they slow down data processing. A bus-connected gas pedal can contain a memory access (DMA) engine. With the addition of additional logic, the DMA engine allows the coprocessor to work on blocks of data located in the memory connected to the bus, independent of the CPU.

2. I/O connectivity

  The gas pedal connected to the I/O is directly connected to a dedicated I/O port. Data and control are usually provided via GET or PUT functions. Because of the lack of arbitration, reduced control complexity, and fewer connected devices, these interfaces typically have faster drive clocks than the processor bus. A good example of such an interface is the Xilinx Fast Simplex Link (FSL), a simple FIFO interface to a Xilinx MicroBlaze soft-core processor or Virtex-4 FX PowerPC. The data moved through the FSL has lower latency and higher data rate than data moved in the processor bus interface. 3.

3. Instruction Pipeline Connectivity

  The instruction pipeline connection gas pedal connects directly to the CPU’s compute core. By connecting to the instruction pipeline, instructions that are not recognized by the CPU can be executed by the coprocessor. Operands, results, and state are passed outward, or received, directly from the data execution pipeline. A single operation can implement the processing of two operands and return a result and state at the same time.

  As a directly connected interface, the gas pedal connected to the channel instruction pipeline can be driven with a faster clock than the processor bus. xilinx implements this coprocessor connection model through the APU interface, which for a typical double operand instruction can reduce the clock cycles by a factor of 10 in data control and data transfer. the APU controller is also connected to the data cache controller, through which data load/store operations. As a result, the APU interface can move hundreds of megabytes per second, approaching DMA speeds.

  I/O connection gas pedals or instruction pipeline connection gas pedals can be combined with bus connection gas pedals. With the addition of additional logic, it is possible to create a gas pedal that runs on a data block located on a bus-connected memory, receiving commands and returning status through a fast, low-latency interface.

  The C-HDL toolset described in this document enables bus-connected and I/O-connected gas pedals, and it also enables gas pedals that connect to the APU interface of a PowerPC. Although APU connectivity is command pipeline-based, the C-HDL toolset implements an I/O pipeline interface that has the typical performance of an I/O connectivity gas pedal.

FPGA/PowerPC/APU Interface

  FPGAs allow hardware design engineers to implement a complete computing system using processors, decode logic, peripherals and co-processors on a single chip. FPGAs can contain thousands to hundreds of thousands of logic units from which a processor, such as the Xilinx PicoBlaze or MicroBlaze processor, can be implemented, or one or more hard logic units (such as the Virtex-4 FX PowerPC). The large number of logic units allows you to implement data processing units that work with the processor system and are controlled or monitored by the processor.

  The FPGA, being a reprogrammable unit, allows you to program and test it during the design process. If you find a design flaw, you can immediately reprogram the design. the FPGA also allows you to implement hardware computing functions that were previously costly to implement. there is a tight integration between the CPU pipeline and the FPGA logic so that high performance software gas pedals can be created.

  The block diagram in Figure 1 shows the PowerPC, the integrated APU controller, and a co-processor connected to it. Instructions from cache or memory can immediately appear on the CPU decoder and APU controller, and if the CPU recognizes the instructions, it runs them. Otherwise, the APU controller or a user-created coprocessor can respond to the instructions and execute them. One or two operands are passed to the coprocessor and a result or status is returned. the APU interface also supports sending a data unit with an instruction. The size of the data unit can range from one byte to four 32-bit words.

Figure 1: PowerPC, integrated APU controller and co-processors

  One or more coprocessors can be connected to the APU interface through a fabric coprocessor bus (FCB). The range of coprocessors connected to the bus includes existing cores (e.g., FPUs) to user-created coprocessors. A coprocessor can be connected to the FCB for control and state operations, and to a processor bus for direct memory data block access as well as DMA data transfer. A simplified connection scheme, such as FSL, can also be used between the FCB and the coprocessor to enable FIFO data and control communication at the expense of some performance.

  To demonstrate the performance benefits of an instruction pipeline-connected gas pedal, we implemented a design using a processor bus-connected FPU first, and then an APU/FCB-connected FPU implementation. Table 1 summarizes the performance of the finite impulse response (FIR) filter for both implementations. As reflected in Table 1, connecting to an instruction pipeline FPU increases the software floating-point speed by a factor of 30, while the APU interface improves by a factor of nearly four compared to the bus-connected FPU.

 Table 1: Unaccelerated vs. accelerated floating point performance

C code to HDL conversion

  Converting C code to an HDL gas pedal using a C to HDL conversion tool is an efficient way to create a hardware coprocessor. The process of C to HDL conversion is summarized in Figure 2 and in the steps detailed below.

Figure 2: C-HDL design flow

  Implement the application or algorithm using standard C tools. Develop a software test vector (test bench) for baseline performance and correctness (host or desktop emulation) testing. Use a compiler (e.g. gprof) to start identifying critical functions.

  Determine if the floating-point to fixed-point conversion is appropriate. Use a library or macro to assist with this conversion, and use a baseline test vector to analyze performance and accuracy. Use a compiler to re-evaluate critical functions.

  Use a C to HDL conversion tool (e.g. Impulse C), repeated on each key function, to: partition the algorithm into parallel processes; create hardware/software process interfaces (streams, shared memory, signals); automate optimization and parallelization of key code segments (e.g. internal code loops); and use desktop PC simulation, cycle-accurate C simulation, and actual in-system testing to Test and verify the obtained parallel algorithms using desktop PC simulation, cycle-accurate C simulation, and actual in-system testing.

  Use C to HDL conversion tools to convert critical code segments to HDL coprocessors.

  Connect the coprocessor to the APU interface for final testing.

  Impulse: C to HDL Conversion Tool

  Impulse C, shown in Figure 3, enables embedded system design engineers to create highly parallel, FPGA-accelerated applications by combining the use of C-compatible library functions with the Impulse CoDeveloper C code-to-hardware compiler. Impulse C provides automatic optimization of C code (e.g., loop pipelining, unfolding, and operator scheduling) and interactive tools that allow you to analyze hardware behavior on a per-cycle basis.

Figure 3. Impulse C

  Impulse C is designed for data-flow oriented applications, but it is also flexible enough to support other programming models, including the use of shared memory. This is important because different FPGA-based applications have different performance and data requirements. In some applications, it makes sense to transfer data between the embedded processor and the FPGA via block memory reads and writes; in other cases, a streaming number communication channel may provide higher performance. The ability to quickly model, compile, and evaluate optional algorithms is important to achieve the best results for a given application.

  To date, the Impulse C library contains minimal C language extensions in the form of new data types and predefined function calls. Using Impulse C function calls, you can define multiple parallel program segments (calling processes) and describe their interconnection using streams, signals and other mechanisms. the Impulse C compiler converts and optimizes these C processes into: lower-level HDL that can be synthesized into FPGAs or can be compiled into standard C on supported microprocessors by the widely available C cross-compiler (with associated library calls).

  The complete CoDeveloper development environment includes desktop emulation libraries compatible with standard C compilers and debuggers, including Microsoft’s Visual Studio and GCC/GDB. Using these libraries, Impulse C programming engineers can compile and execute their applications for algorithm verification and debugging purposes. c programming engineers can also examine parallel processes, analyze data movement, and solve process-to-process communication problems using the CoDeveloper ApplicaTIon Monitor.

  At compile time, the output of the Impulse C application is a set of hardware and software source files for input to the FPGA synthesis tool. These files include

  The automatically generated HDL files used to describe the compiled hardware processes.

  automatically generated HDL files for describing the stream, signal and memory components required to connect the hardware process to the system bus.

  automatically generated software components (including runtime libraries) used to establish the software side of any hardware/software flow connections.

  Additional files, including script files, for inputting the generated application to the target FPGA layout wiring environment. The result of this compilation process is a complete application, including the required hardware/software interfaces, for implementation on an FPGA-based programming platform.

  Design Example

  The Mandelbrot graph shown in Figure 4 is a classical irregular geometry that is widely used in science and engineering to simulate disorderly events, such as weather. Mandelbrot images are described as self-similar. By zooming in on a part of a graph, another graph can be obtained that resembles the whole graph.

Figure 4: Mandelbrot graphs

  Mandelbrot graphs are ideal for hardware/software co-design because they feature a single computationally intensive function. By transferring the critical function to the hardware implementation will greatly increase the speed of the overall system and make this critical function faster. the Mandelbrot application also clearly distinguishes between hardware and software processes, which is easy to implement using C-HDL tools.

  This paper uses the CoDeveloper toolset as the C-HDL toolset for this design example, and only the software Mandelbrot C program was modified to make it compatible with the C-HDL tools. Among the changes were: splitting the software project into different processes (separate units for sequential execution); function interface conversion (hardware to software) to streams; and adding compiler instructions to optimize the resulting hardware. We then used the CoDeveloper toolset to create the Pcore coprocessor, which we imported into Xilinx Platform Studio (XPS). Using XPS, we connected the PC to the PowerPC APU controller interface and tested the system.

  A full description of the design and design files are provided in Xilinx’s application note material XAPP901, which is available for download. Also, User Guide UG096 provides a step-by-step design guide for implementing the design example.

  Performance improvements were measured for the Mandelbrot image texture problem, image filtering application, and triple DES encryption. The performance improvements are shown from 11x to 34x speedups and are summarized in Table 2.

 Table 2: Comparison of algorithm acceleration performance via coprocessor gas pedals

Summary

  Constrained by power, size, and cost, you may need to make a processor choice that is not ideal, and often the chosen processor performance is lower than the desired performance. Co-processor code gas pedals become an attractive solution when software code does not run fast enough. You can design accelerators manually in HDL or use C-HDL tools to automatically convert C code to HDL.

  Using a C-HDL tool like Impulse C makes running the gas pedal faster and simpler. Virtex-4 FX FPGAs have two embedded PowerPCs that enable a tight connection between the processor instruction pipeline and the software gas pedal. As mentioned above, critical software programs increase in speed by a factor of 10 to 30, allowing the 300 MHz PowerPC to provide performance equal to or better than that of a high-performance GHz-class processor. The examples above each take only a few days to produce and show the rapid design, implementation and testing using the C-HDL flow.

    GET A FREE QUOTE

    FPGA IC & FULL BOM LIST

    We'd love to

    hear from you

    Highlight multiple sections with this eye-catching call to action style.

      Contact Us

      Exhibition Bay South Squre, Fuhai Bao’an Shenzhen China

      • Sales@ebics.com
      • +86.755.27389663