Research on Three-Mode Redundant Fault Tolerance Technology Based on FPGA

Abstract: SRAM-based FPGAs are very sensitive to spatial particle radiation and are prone to soft failures, so it is important to take fault-tolerant measures against FPGA-based electronic systems to prevent such failures. By using the three-mode redundancy (TMR) method for sensitive circuits and utilizing the dynamic reconfigurable features of the FPGA, the single-particle performance of the FPGA can be effectively enhanced, and the soft fault caused by the space particle radiation can be solved by the FPGA.

1 Introduction

With the continuous development of Field Programmable Gate Array (FPGA) technology, FPGAs have greatly improved the flexibility and versatility of electronic system design due to their short development cycle and low R&D cost. Applications in aerospace, communications, medical, industrial control and other fields.

SRAM-based Field Programmable Gate Array (FPGA) is particularly sensitive to the radiation of charged particles, especially the emergence of high-density integrated chips in recent years. The increase in circuit capacity and operating voltage make them in a radiation environment. Reduced reliability. Soft faults are the main faults. They are transient faults caused by the interaction of particles and PN junctions. Soft faults have a particularly severe impact on the circuits implemented on SRAM-based FPGAs.

Triple Modular Redundancy (TMR) technology is a widely used fault-tolerant technology for single-event upset (SEU) on FPGAs that can significantly improve the reliability of FPGAs under the influence of SEU. However, due to the additional modules and wiring, it consumes a lot of hardware resources and power consumption, and the working speed is also affected, which limits the use of traditional TMR. With the development of electronic technology, especially partial reconfigurable technology, a variety of improved TMR technologies have emerged, which have solved the problems of the traditional TMR methods in a targeted manner, and the TMR technology has been developed.

2 General TMR methods and existing problems

The principle of the triple redundancy technique can be simply understood as three copies of the same circuit, and then the "majority vote" arbitration of the outputs of the three circuits, with at least two of the same output as the final output. TMR is a very effective technique for mitigating SEU, but when the energy of a single particle is sufficient to cause two simultaneous SEUs in three cells, this simple TMR technique will fail, but the probability of this happening is low. Therefore, TMR is a fault-tolerant method that is now more effective and used in large quantities, and is widely used to prevent the influence of SEU on the system caused by radiation.

Figure 1 Basic structure of the conventional TMR method

The traditional TMR method can effectively improve the reliability of the design, but it also has many shortcomings:
1) It cannot repair an erroneous unit. When one of the three cells fails, it simply masks the error through the majority of the reducer, but the error cell module still exists. Moreover, the general TMR cannot detect and locate errors for the system to repair. If the error is not fixed in time, the TMR will fail when the error occurs again.

2) Ordinary TMR resources have large overhead and low resource utilization. Ordinary TMR is a three-mode redundancy for the entire design or larger modules, with a larger granularity, and its resource overhead is increased by 300% compared to the original circuit. Implementing TMR for the entire circuit or module wastes resources.

3) The power consumption is increased due to the multiplication of the circuit, and the speed is lowered due to the presence of the voter and some other extra wiring.

4) The voter itself may also be in error, and the general TMR voter has no self-checking capability and is not resistant to radiation.

5) When a three-mode redundant circuit driver is used without a redundant circuit, a voter is required to combine the three signals into one signal. When a three-mode redundant circuit is used without redundant circuit driving, one signal needs to be expanded into three signals by additional wiring. Because both logic and routing resources are sensitive to SEU, such results can degrade system reliability.

3 Improved TMR method

1) Dynamic reconfigurable technology Because TMR itself does not have the ability to repair the wrong module, if only one module has an error, the system function is not affected, but if the error module cannot be repaired before another module has an error, Then the redundant method will fail. So when an error occurs, the module that has failed must be repaired in a timely manner. The faulty module can be repaired in time using the FPGA local dynamic reconfigurable technology. Dynamic reconfigurable technology is the dynamic function transformation of all or part of the logic resources of the FPGA based on SRAM programming technology during system operation.

System reconstruction can be divided into static system reconstruction and dynamic system reconstruction. The former refers to the static overload of the logic function of the target system, that is, the function of the FPGA chip is controlled by external logic, and the logic function of the chip is changed by re-downloading the data of different target systems stored in the memory. An FPGA programmed for a conventional SRAM can only be used to implement static system reconstruction. The latter refers to the digital logic system for timing changes. The timing logic is not generated by calling different regions and different logic resources in the chip, but by performing local and global chips on the FPGA with dedicated cache logic resources. The dynamic reconstruction (or modification) of logic is implemented quickly. The dynamic reconfigurable FPGA internal logic block and interconnect changes can be directly implemented by reading different SRAM bit data. The time is often on the order of nanoseconds, which helps to realize the dynamic function of the FPGA system logic. Refactoring.

Since the most serious impact on space electronic systems is soft faults such as SEU, and soft faults can be solved by reconfiguration, periodically refreshing the configuration memory can fix such errors.

The TMR circuit can design a voter with error detection and positioning function. When a module fails, the signal of the voter directly triggers the reconstruction function, and dynamically reconstructs only the circuit of the error part. This can solve the time and power consumption problems generated by timing refresh, and provide a solution to prevent error accumulation.

In order to prevent the voter from making mistakes, the voter can be implemented with radiation-insensitive devices instead of SRAM-based materials, which improves the robustness of the voter. The improved voter no longer uses the majority voter to vote on the outputs of the three redundant modules, but passes the corresponding outputs of the three redundant modules through the tristate buffer and the minority voter, respectively, by the three output tubes of the FPGA. The pin outputs and finally "wires" a signal on the printed circuit board (PCB). A few voter circuits are responsible for determining whether the signal of the redundancy module is a minority value. If it is a minority value, the corresponding buffer output is high impedance, and if not, the corresponding signal is normally output.

Readback is developed on the basis of dynamic reconfigurability. It refers to comparing the configuration data readback with the original configuration data, reconstructing after finding the error, and using the error correction code to protect the configuration data. The data of each configuration frame is protected by the 12-bit see-dec Hamming code, and the identification code of each basic unit in the FPGA is different. After reading back the configuration file through ICAP (InternalConfiguraTIon Access Port), the error correction code can give an error. The location of the bit.

Dynamic reconfigurable technology can fix the functional errors caused by SEU in LUT, routing matrix and CLB without interrupting the circuit operation, effectively enhancing the single-particle capability of FPGA circuits.

2) Local Sensitive Circuitry TMR Technology With the advent of some dynamic reconfiguration techniques, the local sensitive circuit TMR method can be used. With a small granularity, the TMR is implemented with reasonable layout and routing to achieve the required resource overhead and maximize reliability. Due to limited resources, TMR for locally sensitive circuits is a good choice when global TMR is not possible, and system reliability can be improved with less resources. Since redundancy is not a requirement for all modules, it is important to implement TMR technology for modules that can increase system reliability relatively high. The number and location of the voter at this time is also a problem to be considered. Since the use of three-mode redundant modules requires additional wiring before and after, while logic and routing resources are sensitive to SEU, such results can reduce system reliability.

Figure 2 Schematic diagram of the local sensitive circuit TMR

In order to select modules that require three-mode redundancy and make proper placement and routing, the errors that occur in the system are classified into persistent errors and non-persistent errors. Persistent errors are errors that are caused by the SEU that change the internal state of the circuit; non-persistent errors are errors that can be eliminated by FPGA refactoring, and persistent errors persist after refactoring.

Combined with the above analysis, the priority of implementing some TMRs is as follows:
The first level is the part that produces a continuous error.
The second stage is a circuit that causes a partial error in the circuit that can cause continuity errors to reduce the transition between TMR and non-TMR.
The third stage is the forward part that produces a continuous error circuit, again as a criterion for reducing the transition between TMR and non-TMR.
The fourth level is the part that is independent of the circuit that will produce a continuous error.

The circuit can be divided by static analysis. The problem here is that in the standard global TMR, all inputs, outputs, and clocks are tri-mode redundant, and with some TMRs, I/O and clock redundancy may not be implemented. As with logic circuits that do not use TMR, clocks and I/O without TMR can also generate errors that cannot be detected.

From the experimental results, since this method mainly focuses on the circuit part that can generate persistent errors, when the redundant resources used increase, the probability of occurrence of persistent errors is quickly reduced, and finally almost all are overcome. Therefore, some TMRs can be used to achieve a balance between resources and reliability, and maximize resource utilization with minimal impact on reliability.

4 Conclusion

With the rapid development of FPGAs, the integration of chips is getting higher and higher, and the operating voltage is continuously decreasing, which leads to the decline of the reliability of FPGAs under radiation conditions. Especially the impact of soft faults represented by SEU is increasing, so When implementing a system with an SRAM-based FPGA, fault tolerance must be taken.

Based on the reliability advantages of traditional TMR, the space fault-tolerant measures using local sensitive circuit TMR technology and FPGA local dynamic reconfigurable technology can effectively avoid circuit soft faults due to spatial particles. TMR technology based on local sensitive circuits will be It will be a major development direction of TMR technology.

Stainless Steel Stamping Belt

Stainless Steel Stamping Belt,Industry Stainless Steel Conveyor Belt,Stainless Steel Strips For Furniture,Portable Food Industry Conveyor Belt

ShenZhen Haofa Metal Precision Parts Technology Co., Ltd. , https://www.haofametals.com

Posted on