Our Latest News

RAM base block size based on FGPA underlay

Use of Storage Resources in Design

Different users may need different capacities of RAM to build their specific applications. So the size of the RAM base block at the bottom of FGPA is an interesting topic. If it is too large, it is not flexible enough to meet small capacity applications. Of course, it is possible to directly use large capacity RAM to implement small capacity applications, but this will inevitably result in a large waste of resources; if the base block is too small, medium or large RAM applications will require a large number of small RAMs to be constructed, and small RAMs will be equipped with the same signal input interface for universality, which will greatly consume This will greatly consume the internal wiring resources of the FPGA, and even cause wiring congestion and timing problems.

Almost all designs built in FPGAs require the use of internal memory resources of some size to store coefficients, buffer data, and for a variety of other purposes. Typical systems require a combination of small, medium, and large memory arrays to meet all of their dfe requirements, and the overall power consumption of the memory is therefore a major design concern.

When designing FPGAs, it is important to create devices that meet most customer requirements. If the FPGA is built with small, medium, and large memory resources suitable for an application, then the solution will be optimal for some customers, while others who want to use the same parts may have to make considerable trade-offs.

Users trying to get the best value out of their FPGAs may be concerned about wasted resources in high-capacity RAM. However, building finer, smaller RAM sub-blocks requires additional connectivity, which comes at a cost. This article explains the trade-off: why finer RAM base blocks typically cost more.

Figure 1 shows the theoretical distribution of small, medium, and large memory blocks in FPGAs (not drawn to any particular scale).

It is precisely this combination of blocks that is needed for a design that makes perfect use of the available resources (see Figure 2).

However, imagine a scenario where —- users need only four other medium-sized resources.

One approach would be to build a medium-sized memory array with a large number of widgets, which would consume a lot of resources and lead to the complexity of linking them together. Another option is to use one large block as the middle block, making all remaining resources in the large block unavailable while keeping them powered on and therefore consuming power (see Figure 3).

The challenge for FPGA manufacturers is to build devices with the most flexible mix of memory resources, allowing all users to fit the size of memory array they need into the device, while achieving the required performance without wasting significant resources and power.

Memory Resources in FPGAs

This chapter is about BRAM resources, more details can be found here: Learning FPGAs from the Underlying Architecture —-Block RAM (BRAM, Block RAM)

Xilinx FPGAs use a variety of storage resources to provide the best combination of flexibility and low cost. All 7-series FPGAs, including the ArTIx-7, Kintex-7, and Virtex-7 families, use the same memory blocks, allowing for perfect migration from one 7-series FPGA family to another.

Obviously, building memory resources to meet each user’s needs is a daunting challenge, and the solution implemented in Xilinx 7 series FPGAs is to create base blocks called block RAMs (see Figure 4), which can be combined together to form larger arrays or split to form smaller arrays. The ability to combine block RAM with a 6-input look-up table (LUT) in the FPGA logic to form small memory arrays provides the user with the most flexible resources to create memory arrays of various sizes.

Block RAM

Each 7 Series FPGA has 135 to 1880 dual-port block RAMs, each capable of storing 36 Kb of data, with 32 Kb allocated to data storage and, in some memory configurations, an additional 4 Kb allocated to parity bits. Each Block RAM has two completely independent ports that share only the stored data.

Each port can be configured as.

32K x 1 16K x 2 8K x 4 4K x 9 (or 8) 2K x 18 (or 16) 1K x 36 (or 32) 512 x 72 (or 64)

Each Block RAM can be divided into two completely independent 18 Kb Block RAMs, each of which can be configured in a ratio from 16K x 1 to 512 x 36. When a 36 Kb Block RAM is divided into two independent Block RAMs, each of the two independent Block RAMs behaves exactly the same as a 36 Kb Block RAM, except that the size becomes half.

Conversely, if the user requires a larger memory array, the two adjacent 36 Kb block RAMs can be configured as a cascaded 64K x 1 dual-port RAM without any additional logic and resources.The Block RAM components in Xilinx 7 Series FPGAs can be configured in single-port, simple dual-port, or true dual-port mode. In addition, data can be read from Block RAM in one of three ways: READ_FIRST, WRITE_FIRST, or NO_CHANGE mode.

Splitting Block RAM

If the user only needs single-port memory, rather than implementing full true dual-port capability, Block RAM can be split into smaller memory arrays. When Block RAM is in true dual-port mode (default mode), Port A and Port B can be implemented as separate, independent delay line memories, single-port memories, or ROMs by connecting the highest significant bit of the ADDRA address bus to VCC (high), thus creating two single-port block RAMs.

For example, a RAMB36E1, 36K block RAM primitive, can be divided into two 18K single-port memories, and a RAMB18E1 can be divided into two 9K single-port memories. Using this approach, four delay line memories can be created for each Block RAM array in a 7 Series FPGA. To implement delay lines in Block RAM in this manner, READ_FIRST or Read-Before-Write mode must be used.

If a single-port memory is implemented, there is no such restriction on the allowed modes; any supported mode (READ_FIRST, WRITE_FIRST, or NO_CHANGE) can be used, and different memories in a Block RAM can have different port widths. The resulting memory scheme performs one operation per clock cycle for each port, so that four operations are performed per clock cycle per block RAM.

Synchronous Operations

Each access to memory, either read or write, is controlled by the clock. All inputs, data, addresses, clock enable and write enable are already stored. Nothing happens without the clock. Input addresses are always clock-controlled, and data is held until the next operation. An optional output data pipeline register allows higher clock rates at the cost of additional delay cycles. During a write operation, the data output can reflect previously stored data, newly written data, or remain unchanged.

Byte Wide Write Enable

The byte write enable feature of Block RAM provides the ability to write eight bits (one byte) of input data at a time. True dual-port RAMs have up to four independent byte-wide write enable inputs. Each byte-wide write enable is associated with one byte of input data and one parity bit. This feature is useful when using block RAMs to interface with the microprocessor.

Error Detection and Correction

Each 64-bit wide block RAM in Xilinx 7 Series FPGAs can generate, store, and utilize eight additional Hamming-code bits and perform single-bit error correction and two-bit error detection (ECC) during reads. ECC logic can also be used when writing to or reading from external 64/72-bit wide memory. This applies to the simple dual-port mode and does not support read-during-write.

FIFO Controller

The built-in FIFO controller in Xilinx 7 Series FPGAs for single-clock (synchronous) or dual-clock (asynchronous or multi-rate) operation adds internal addresses and provides four handshake flags: full, empty, almost full, and almost empty. The almost full and almost empty flags are freely programmable. Similar to block RAM, the FIFO width and depth are programmable, but the write and read ports always have the same width. The first-word-through mode presents the first even before the first read operation, writing the word on the data output. After the first word is read, this mode is indistinguishable from the standard mode.

Distributed Memory (DRAM)

The logic of Xilinx 7 Series FPGAs consists of elements such as 6-input LUTs. LUTs are arranged in groups of four and combined into a Slice. 7 Series FPGAs have two types of Slice —-SLICEM and SLICEL. The LUTs in SLICEM represent 25-50% of the total number of Slices in 7 Series FPGAs and can be implemented as what is called The LUTs in SLICEM represent 25-50% of the total number of slice in the 7 series FPGA and can be implemented as synchronous RAM resources called distributed RAM elements. Each 6-input LUT can be configured as one 64 x 1-bit RAM or two 32 x 1-bit RAMs. 6-input LUTs in the SLICEM can be cascaded to form larger elements, up to 64 x 3 bits in a simple dual-port configuration or up to 256 x 1 bits in a single-port configuration. See Figure 5.

Distributed RAM modules are synchronous (write) resources. Synchronous reads can be achieved using a memory element or trigger in the same slice. By placing this trigger, distributed RAM performance is improved by reducing the latency of the trigger clock output value. However, additional clock latency is added. Distributed elements share the same clock input. For write operations, the Write Enable (WE) input, driven by the CE or WE pin of the SLICEM, must be set high.

A typical user design utilizes

Table 1 shows the resource mapping in a typical memory high-utilization design for the Kintex-7 XC7K410T FPGA. This data is based on real-world examples of users concerned about bit wastage.

Multiple different resources take advantage of the ability to configure block RAM resources in different sizes.

The Cost of Infinitely Small Baseblocks

Having determined that the Xilinx FPGA architecture offers many different memory depth/width granularities, it is important to understand the trade-offs involved in adding finer base-blocks to the architecture. For example, if a 36K block of RAM in a 7 Series FPGA is divided into not only two 18K blocks, but can be further divided into four 9K true dual-port blocks, there is a cost to this. Doubling the number of unique memories in a single Block RAM means that the maximum number of signals going to and from each Block RAM also needs to be doubled, which in turn requires doubling the number of interconnect resources (or blocks) to accommodate the routing of about 400 signals.

For example, an 8K block RAM can be configured to be 16 bits wide and 512 bits deep, requiring a total of 25 input signals —-16 data lines and 9 address lines. The 8K block RAM is then divided into two 4K block RAMs. each of these block RAMs can be configured to be 16 bits wide and 256 bits deep. This configuration requires 16 data lines and 8 address lines per block, for a total of 24 input signals per 4K block RAM or a total of 48 signals. 48 signals is approximately twice the 25 input signals required by the 8K block RAM. The effect of doubling the number of interconnect blocks associated with each Block RAM increases the silicon area by 25% – all blocks are penalized regardless of the configuration used. Thus, the ability to configure each Block RAM as four 9K blocks means that each block increases from four area cells to five area cells. On the positive side, it allows smaller memories to take advantage of smaller block sizes (see Table 2).

Referring to the same design example, if the smaller 512 x 18 memory can be efficiently packed into adjacent 9K blocks, the impact amounts to less resource consumption. However, the larger blocks are 25% larger, adding a significant area loss. Therefore, if the Block RAM in a Kintex-7 FPGA can be divided into four separate memories, and the user design can take advantage of these smaller granularity blocks, then this typical design still takes up more area than the current 36K/18K configuration. However, due to the wiring congestion of trying to connect so many signals to one bus, it is not possible for a design to fully utilize four independent 9K true dual-port memories by packing them into the same 36K Block RAM. Therefore, the area multiplication in Table 2 is a best-case scenario; in reality, not all 36K blocks of RAM will be populated by four 9K memories.

Impact on Device Resources

The impact of increased block RAM resource size can have one of three effects on overall device resources.

Keep the device the same size and lose block RAM bits Keep the number of bits of Block RAM the same and reduce the number of other resources, i.e. CLB Leave all resources as they are

Making the device larger Each of these choices has a clear penalty: the

Losing Block RAM bits is not ideal because most users are using most of the available memory. If the block RAM is increased by 25% all together, the number of blocks in the Kintex-7 XC7K410T FPGA is reduced from 795 to 636, which equates to a reduction from 28,620 Kb to 22,896 Kb A 25% increase in block RAM size means a loss of one CLB per block RAM column. the Kintex-7 XC7K410T FPGA has 12 block RAM columns. So in this case, it would lose 12 columns of CLBs from the device, which equates to a loss of 4,200 CLBs, which equates to about 54,000 logic cells. This would have the effect of reducing the Kintex-7 XC7K410T FPGA from 406,720 logic cells to 352,720 logic cells, resulting in a less powerful device for the same cost. There are several drawbacks to keeping all resources the same and allowing for increased silicon area. The first obvious one is that larger devices mean more expensive devices, but physically larger silicon also has a significant impact on power consumption. Power Optimization of Block RAM Arrays The Importance of Power Consumption

Power consumption is critical in most modern applications, and reducing the power consumption within a system is a challenge for every designer. There are many ways to reduce power consumption, but many come with significant performance losses. With the 7 Series FPGAs, Xilinx has taken an innovative approach to reducing the power consumption associated with Block RAM memory arrays. The two primary methods for reducing power are

Identifying and acting on areas that are unnecessarily consuming power Providing the ability to trade-off power reduction with a slight reduction in maximum performance Unused Block RAM

All Block RAM components consume power at power-up, whether they are used in the design or not. a unique feature in Xilinx 7 Series FPGAs enables software to automatically identify unused Block RAM. when unused Block RAMs are identified, they are automatically disabled and put into a zero-power state, significantly reducing the overall power consumption of the FPGA.

Using XST (xilinx synthesis tools) to infer RAM

XST will infer Block RAM if the read address is hosted within the block, and conversely, if the RAM network uses asynchronous read operations, it will be inferred as distributed RAM. ram_style property can be used to specify whether to use Block RAM or distributed RAM. multiple Block RAMs can be used when the RAM network is larger than the size of a base block. the default policy is to optimize for performance. The default policy is to optimize for performance. The RAM network can also be optimized for power and area; these are described in the section on building RAM with CORE Generator software in this document. Anything less than 128 bits deep or 512 bits deep x width is implemented in distributed RAM, unless otherwise specified by the user using the ram_style attribute. To ensure the most efficient use of Block RAM and distributed RAM resources, XST implements the largest RAM in Block RAM first, and if block RAM resources are still available, smaller RAMs are placed in Block RAM.

Techniques to reduce the power consumption of Block RAM can be enabled in XST. They are part of a larger set of optimizations controlled by the POWER synthesis option and can be enabled specifically through the RAM_STYLE constraint. the RAM power optimization techniques in XST are primarily designed to reduce the number of Block RAMs active on the device at the same time. They are only applicable to inferred memories that require decomposition of multiple Block RAM primitives and take advantage of the enable capability of Block RAM resources. Additional enable logic is created to ensure that only one Block RAM primitive used to implement the inferred memory is enabled at the same time. Activating power reduction techniques for inferred memories that fit into a single Block RAM primitive is ineffective. When enabled, power reduction is sought in conjunction with area and speed optimization targets. The RAM_STYLE constraint allows for two optimization tradeoffs.

Mode block_power1 achieves a certain level of power reduction with minimal impact on circuit performance. In this mode, the default, performance-oriented Block RAM decomposition algorithm is retained; XST simply adds block RAM enable logic. Depending on the memory characteristics, power can only be affected in a limited way.

Mode block_power2 provides more significant power reduction, but may have a slight impact on performance. Additional slicing logic can also be introduced. This mode uses a different approach to Block RAM decomposition. This approach first aims to reduce the number of block RAM primitives needed to implement inferred memory. The number of active Block RAMs is then minimized by inserting Block RAM enable logic. Multiplexed logic is also created to read data from the active Block RAM. If the primary focus of the design is to reduce power consumption and speed and area optimization is secondary, then Xilinx recommends using the block_power2 mode.

ROM can be inferred by using large case statements. xst can also implement finite state machines (FSM) and logic in Block RAM to maximize available logic resources.

Building RAM with CORE Generator

The CORE Generator tool has three algorithms that can be used to optimize Block RAM networks. The minimum area scheme uses as few resources (block RAM) as possible, but also minimizes output multiplexing to maintain maximum performance on the smallest area (see Figure 6). Low-power schemes can use more resources, but ensure that the fewest blocks are enabled during each read and write operation. This may generate a small amount of additional decoding logic on the enable signal, but this is a small area loss compared to the power saved. When running in this mode, the CORE Generator tool performs the same functions as XST in RAM_STYLE = block_power2 mode. Note that if the Block RAM network is very large, splitting it using a low-power scheme may result in the need to route many additional signals, which may deplete routing resources in other parts of the design.

The third optimization option is Fixed PrimiTIve, in which users can select a specific primitive, such as 4K x 4, from which to build their RAM network. The CORE Generator tool also provides an option to host the output of the RAM network to improve performance. If multiple Block RAMs are used in the network, the user can choose whether to host the output of each Block RAM primitive or the output of the kernel.

Conclusion

Combining block RAM, which can be configured in various data width/depth combinations with distributed RAM to support smaller memory arrays provides the most flexible way to build different sizes of storage memory at the lowest cost. Adding additional functionality to a component, such as to block RAM, may initially be the best solution necessary if the exact functionality is correct. However, both the component and its components have area penalties for adding new features to the corresponding interconnects. One of these area penalties has a greater impact on device resources. Years of development experience building FPGA embedded memories has led to a wide range of efficient solutions for a wide range of applications.

    GET A FREE QUOTE

    FPGA IC & FULL BOM LIST

    We'd love to

    hear from you

    Highlight multiple sections with this eye-catching call to action style.

      Contact Us

      Exhibition Bay South Squre, Fuhai Bao’an Shenzhen China

      • Sales@ebics.com
      • +86.755.27389663