3 problems to overcome the self-healing failure and out-of-the-box solutions

An Analysis of the Ideas of the Automation of Enterprise Construction Faults Based on the Product Design Concept

Manually handling alarms has always been a pain in the heart of operation and maintenance. New Year's Day, New Year's Day, New Year's Day, marriage, and wife and child outside the weekend and other good times, as the operation and maintenance of you, it seems that has been concerned with the IT system, and maintain a safe distance with the notebook.

Why do so many years have passed, or is it so hard, not to say that the operation and maintenance industry to turn AIOps, and I actually still manually handle the alarm, what should I do?

Today we will talk with you about the three problems that you need to overcome to achieve self-healing failures, as well as offering out-of-the-box solutions.

1. The basic process of self-healing

What is the point of automation? The human experience is abstracted and solidified into program processing. This is true of industry (the 3rd industrial revolution) or the Internet.

For example, the disk is alarmed. The first thing that the operation and maintenance think of is that the login server cleans up the disk.

3 problems to overcome the self-healing failure and out-of-the-box solutions

(The process of manually processing alarms)

Next, we disassemble the logic behind it.

1.1 Abstract Alarm Processing Flow

1) Pull disk alarm

2) Write Disk Cleanup script or job task

3) Design module: the disk alarm is pulled, and it is linked with the module that calls the script.

3 problems to overcome the self-healing failure and out-of-the-box solutions

(Failure self-healing process simplified version V1)

1.2 Resource Cleaning through CMDB

Different modules of the disk cleaning program is not the same, how to solve it?

In this case, the CMDB (device, person, and service mapping relationship) needs to be introduced to clean the IP to the module through the CMDB. In this way, the disk cleaning solution corresponding to the alarm usage at the access layer, the logical layer, and the storage layer is resolved.

3 problems to overcome the self-healing failure and out-of-the-box solutions

(Failure self-healing process simplified version V2)

1.3 Connecting to Enterprise Internal Gateways

The self-healing of the failure may handle the failure and need to notify the user. In addition to invoking jobs, fault self-healing may also require invoking internal gateways, such as server restarts, application servers, and so on.

The use of the ESB of the PaaS layer is a solution to the problem. The ESB encapsulates the internal gateway of the enterprise and solves such problems as rights verification, frequency control, access statistics, route distribution, and self-service access. Do not call bare interfaces directly.

3 problems to overcome the self-healing failure and out-of-the-box solutions

(Notification Scheme for Self-healing Failure)

After this round of exploration, the architecture of fault self-healing is the following.

3 problems to overcome the self-healing failure and out-of-the-box solutions

(failure self-healing process)

1.4 Connecting to Internal Control Products

And so on, it seems that we haven't said how to connect the internal monitoring products of the company. Take Zabbix and Open-Falcon as examples.

3 problems to overcome the self-healing failure and out-of-the-box solutions

1.4.1 Connecting to Zabbix

"When Zabbix encountered a self-healing bug," he introduced a scheme for pulling Zabbix alarms, invoked scripts through ActionScript, and pushed Zabbix alarms to the self-healing alarm pull module.

Push (or callback) can ensure the real-time alarm pull.

3 problems to overcome the self-healing failure and out-of-the-box solutions

(Zabbix push alert example)

3 problems to overcome the self-healing failure and out-of-the-box solutions

(Zabbix script to call push alert)

For the case of docking Zabbix, you can refer to Chen Liang's unattended operation in those years we wanted to do.

In addition to Zabbix, Open-Falcon has a good community in the country, so it also introduced the program to pull its alerts.

1.4.2 Docking Open-falcon

The solution is similar to Zabbix, but Open-falcon provides direct callback functionality and simplifies the process.

3 problems to overcome the self-healing failure and out-of-the-box solutions

(Open-Falcon Configuration Callback Address)

After receiving the alert sent by Open-Falcon, parse the corresponding field.

If the internal CMDB identifies the host by IP, a layer of conversion is required. Since the Open-Falcon resource identifier endpoint is the host name by default, the CMDB auto-discovery function is required to automatically report the host name and provide the host name. Clean the function for IP.

The following is a self-healing example of the disk alarm of the Nginx module. It matches the disk cleanup package of the Nginx module and cleans up the log file of the Nginx module. The entire process takes less than 30 seconds.

3 problems to overcome the self-healing failure and out-of-the-box solutions

(Example of self-healing disk alert)

2. The two sides of failure self-healing

Automatic fault handling is like a knife, with its two sides.

Because it is necessary to ensure the authenticity of the alarm, once the false alarm is automatically processed, it is very sad reminder ...

for example. The network fluctuates and PING alarms occur in a batch In fact, the server is running normally. At this time, you restart the server. Then GG.

How to solve it? Analyze the laws of things.

Alarms occur in batches. You can add a convergence module behind the alarm pull module.

For example, if there are Y alarms within the X time, call for operation and maintenance approval.

If alarms using the same package occur on the same host in period X, the alarms that follow the convergence time window are skipped. For example, if you receive process alarms and port alarms at the same time, you do not need to pull the process twice.

In addition, the original monitoring system does not have the ability to converge, so you can use this function to do alarm summarization. Because the convergence logic is the same, there is only a difference in the way of convergence.

3 problems to overcome the self-healing failure and out-of-the-box solutions

3. Complex Alarm Processing Plan - Combination Package

The technical solution mentioned above is used to handle logically simple alarms. How to solve this complex scenario of fault replacement?

For example, A module is an important module, there is a PING unreachable alarm, first of all to verify whether the A module is really faulty, if it really fails, the next step is to get the backup machine from the resource pool ... fault replacement, etc. It is possible to make mistakes in all links. Consider the scenario of an abnormal branch.

The tree structure can solve this problem. The binary tree is sufficient for most scenarios (success and failure).

3 problems to overcome the self-healing failure and out-of-the-box solutions

(Example of combination packages)

The above picture is a self-healing treatment program that can be called a combination package.

At the same time, the concept of atoms is introduced at the same time. By assembling atoms to satisfy various demand scenarios, it is the same theory as resource organization.

Note: If you want to use a ternary tree, you can actually use the combination package as an atomic package (node).

4. Fault self-healing technology architecture

After the basic process of failure self-healing, the two-faced and complex troubleshooting of failure recovery, we have a technical framework for self-healing failures.

3 problems to overcome the self-healing failure and out-of-the-box solutions

I believe that this is a technical analysis of the self-healing faults that have been verified by the industry, and can provide reference ideas for the auto-disposal solutions within the company.

5. Finishing

When AIOps is in full swing, we need to exercise restraint and give priority to resolving the main contradiction rather than building a tall castle on the air.

As with the product roadmap, priority is given to usability, followed by experience, and finally scalability and ecology, followed by landing.

3 problems to overcome the self-healing failure and out-of-the-box solutions

Finally, it is hoped that the majority of O&M brothers and sisters will be able to break away from the bitter sea of ​​original operation and maintenance as soon as possible, seize the development trend of the industry, master the core technology, and realize their value in the process of change!

Insulated Terminals

Insulated Terminals,Terminals,High-quality insulated terminals

Taixing Longyi Terminals Co.,Ltd. , https://www.lycopperterminals.com

Posted on