Whether it is a supplier or a user, the most difficult thing is not too system failure. At this time, all the pressure is on the IT staff, the leadership of the leadership, the project's period, can even see these faults, Whether it can solve it in time.
According to the author's many years of work, from psychological, responsibilities, communication, source, source, knowledge structure, summarizing the problems in the fault analysis process, sharing with the majority of IT technicians.
1 Introduction
When any computer system has a fault, it may happen during the trial operation, and it may happen after the system is officially run, and it may also occur on a system that has stably runs many years, and it may happen in a small upgrade of the system. Rear. At the same time, the negative impact of the system failure can be large, large to the entire system, all the business cannot be handled, all business outlets are closed; small to just a certain terminal of a terminal cannot be completed.
The system has failed, and the cause of the fault must be analyzed, and the measures are taken as soon as possible to troubleshoot. In general, the difficulty of fault analysis and the negative impact of failures are not directly proportional relationship. The entire system is paralyzed in 1 minute: the central computer room is powered down; the reason for a certain business does not work properly. One day or even several weeks: a length of one input parameter in the program in the program of up to 500,000 rows. It is proportional to the difficulty of fault analysis. It is the complexity of the system. The complexity of the system refers to the number of devices and the length of the software, but refers to the type, software of the device, the number of vendors involved. The more complicated the system, the more equipment, software, manufacturers, people, and personnel, the more difficult the fault analysis may be.
Take the CallCenter system as an example, it includes not only a computer system, but also a telephone exchange system, with a wide variety of equipment, software, complex technology, including: PBX device, small machine device, network device, storage device, No. 7 signaling, CTI middleware, IVR middleware, transaction middleware, web middleware, database software, and C / Delphi / Java programming languages, the suppliers, partners, and related business systems are also complicated than the general system. Therefore, there is a highly complex system, and the analysis of the fault is quite difficult.
The fast exclusion of the system failure requires a complete set of coordination mechanisms and processes. Now, most of the IT companies with strict quality management systems have formulated their own major fault processing processes, and the fault analysis is one of them, and is also the most critical link. However, due to the failure analysis itself involves more technical details and related personnel's comprehensive capabilities, in addition to the process, the fault analysis itself, but it is indeed the inherent law requires the technical staff.
According to the author's many years of work, this paper summarizes the problems that need attention during the fault analysis process, sharing with the majority of IT technicians, hoping to help readers.
2. The necessary psychological preparation 2.1. Failure, always happen, even if you have tested. Most of the software engineers are kind, confident, and conscientious in the system development, testing, and meticulous, after passing, they are relieved to think that their system is already in solid. When the system has failed, I feel that I can't accept such a reality. "How can I make an error? I can't, once I confirm that the system is wrong, I will be embarrassed," Hey, how can you make a mistake? ", In your heart, a frustration. In fact, these two mentalitys are great. Complex software system, no matter how high your code quality is high, the test is perfect, in theory, the error may always exist. If you have failed, you will confirm as soon as possible, accept this reality, then, don't think much, quickly analyze, troubleshoot, users will understand you, and even thank you. Let the user feel that there is a misalignment that does not recognize it, and the evidence cannot be relied, and the kind of person who has no handless. Therefore, the most critical is that you have to have such a heart preparation and ability to prepare, if blind confident and unprepared, then it is ugly. 2.2. Fault, there is always reason, even if you don't know. Computer business systems are highly logical collections, and any operation has its own consequences, and the failure is what happened. Sometimes, I have encountered a failure, we will be depressed: "Don't reason?", "Do you have a good?" We must analyze the fault analysis. The fault has occurred, it can't be denied, you should shoulder responsibility, it is also not shocked. If you say this, you can only indicate "unreasonable", not a computer system, but yourself. 2.3. Fault, always need you to keep a sense of reason. During the process of solving the fault, it is full of a variety of struggles. Regardless of what is unexpected, how embarrassing, how stubborn, stubborn engineers, how arrogant manufacturers technical support, Solve the person in charge of the fault, you must always keep reason, calmly control your emotions, and must not be emotionally out of control. Because, in anger, people's thinking is always lack of logic. Once you lose rationality, you will say something that should not be said, do something that should not do. And what is your goal is to solve the problem, but quarrel will only make the problem more complicated, so that the next cooperation, communication is not smooth, which is undoubtedly a problem that solves the problem. So, in any case, please don't let yourself lose control, but also there is no need to get angry, no matter how difficult problems will always be solved. 2.4. Fault, always exclude, as long as you work hard. "It's really inexplicable", "I really don't understand", after all possible, you still have to find the reason, often depressed, I feel that the mountain is exhausted, I can't force it, I want to give up. At this time, I have to live, there must be confidence, believe in the logic of the computer, believe in my own analysis, the 101st possible may be the answer to the question.
At this time, don't be impatient, summarize it, and then look back, and think more about the problem. Think more thinking, just further, more think about it, will "Liu Qinghua". 3. Dare to be responsible 3.1. Failure, always cover up, if you don't make an analysis, just want to cover up. When the fault occurs, there are not many people who happen. I am afraid. You don't dare to tell others, you don't have to report, even if you don't even have a simple handling, pray, "Don't happen anymore, this is the past Let's! ", This is impossible. The fault is a failure, as long as there is condition, or there will be, I want to make the failure no longer happen, the best way is to analyze the exclusion, hidden, and there is no door. 3.2. Failure, always happens, if you don't exclude. Since the fault has happened, it must be excluded as soon as possible, even if you don't know why, you have to keep in mind, there is nothing else. I can't exist for a little luck. I have seen it in a few days. I haven't happened again. I will thank you, think that things have passed, then you are wrong, you may not be the busiest day, give you the busy day. Come on, then no one gave you a good luck. Users are not provoked, in two or four, repeated it, have happened so many times, what do you want to do? 3.3. Fault, there may always be caused by you if there is no evidence. The system is complicated, and there are more divisions, and all the dorsates. When the fault is faulty, there is always a cattle who jumps out in the first time. "This is nothing to do with me, I can't make an error here ... in other systems, our products have never ...". Can the last facts, everyone can't find out, and finally his problem. When there is a problem, you can't blindly eliminate yourself. If you don't have your heart, you still have to calm down, analyze, regardless of the analysis of the analysis, even if it is not your mistake, it is also a solution to solve the problem. It is possible to solve the entire problem. Similarly, if there is a failure, someone tells you that this is your fault, you must calmly ask him, where is the evidence? Don't matter if the other party provides evidence that the chisel is provided, you must sincerely express your positive attitude, immediately analyze it, and keep communication. At this time, it is still unsatisfactory, to be patient, I have to listen to others, regardless of whether the other party is correct or not, it is helpful to analyze the problem. 3.4. Fault, always can't exclude, if there is no active participation. In a wide range of systems, a wide range of software, the troubleshooting is not a person can solve, or even a company can solve. If you don't have your participation and actively participate, the difficulty of solving the problem will increase, and the time will extend. Moreover, simple avoidance or negative participation, there is no benefit of you, you don't act as an excuse to attack your excuses, but make you a contradictory core, even if this fault is really no one. relationship. Wise practice is still active participation, as long as there are active participation of the relevant parties, the time of troubleshooting will be greatly accelerated.
4. Appropriate communication skills 4.1. Failure, always let the relevant people know if you don't want more trouble. The fault has happened, of course, there is no need to make people around the world know that it is not a good thing after all, but the person who should know is to let people know. In this way, it will not add trouble, but it is good, you tell them that they will tell them more people to ask you. You tell them things, such as business representatives, project managers, department managers, etc. tell them that you are working hard to analyze, when other people ask them, they will prepare, they will talk for you, at least awkward. As for what is what people know, this will rely on you, don't do it, the same, you can't do it. 4.2. Fault, there are always some people who can help you, if you put down the shelf. After the fault occurs, let's take a simple analysis. I feel that it is completely within your technical capability. If you find that you are not wonderful, you don't have a person, and ask others. "Three people, there must be my teacher", "I am not ashamed", these are all rating. Solving the problem, asking others is the chief way, don't hesitate, most people are still happy to answer your question, if he really understands. Of course, there is also a skill, don't ask people, because your time is limited. Moreover, don't ask questions, others are also precious. Choose contact related resources, such as manufacturers technical support, technical websites, original project group members, colleagues, peers, professional forums, etc. These people should be relevant experts or have special understanding of the system. Note that after-sales issues, don't ask the manufacturer's pre-sales technical support. Before you ask someone, you must be clear, ask others what questions, do not need to describe the entire original fault of the system with others, because others may not know what this system is doing, say too much, but Others cave it. If you don't even think clearly, don't worry about it, save the problem, I have never asked, it is degraded. 4.3. Fault, always need you to summarize the power of many parties and take the lead to resolve. System, complex logical architecture system, complex logical architecture system, complex system interface system, if the system has failed, its analysis solves the knowledge you need, it is likely to be your strength. At this time, you have to contact multiple relevant departments of the user, contact the manufacturer of other related support systems, form a virtual project group, conduct a number of multi-party coordination test, and your role should be this emergency project Virtual project manager. All joint tests, expectation, you have to be clear than anyone; all test results need you to summarize, record; all important communication, you have to do it, because the correct information is conveyed, in solving problems The process is most important. In such a multi-party coordination process, you face the overall, other participants face part; other participants' work can be interrupted, but your work thoughts must always be coherent, the purpose is one - - Analyze the cause of the fault, find a solution.
5. The style of chasing the bottom 5.1. Failure, there is always a misconception, if you don't ask you personally. From the system error, you know this matter, the middle may have been filed by many people, including the parties, the parties supervisor, user contacts, business representatives, project managers ... Last to you, and you usually contact users to contact users Inquiry, confirm, and then analyze the error. In fact, regardless of the technical level of the user, you should contact him as soon as possible, and personally inquire about the time, place, phenomenon of the fault, and the first hand of the first hand. Because a fault is inevitable after the transcript, the subtle differences of different faults may be two completely different analytical ideas, so again, the fault phenomenon is, you are the most important thing for you. Steps and these are often overlooked. Many people have analyzed for a long time, they found that they were not right, they found that they did not have such things. It was already a waste of time and energy, and also delayed the time of troubleshooting. 5.2. Fault, always hide real reasons, if you ask less. Under the premise of failure, you can always find some direct reasons. For example, the fault phenomenon is that a business cannot be submitted. It has been analyzed. It is found that several services in the system are down. Is it directly to restart these services? Not necessarily! Should be again asked: Why is these services going? Will you have a problem with these services? Is these services qi? What is the order of their dull? Is there a relationship between these services? and many more. In this way, you will find a potential reason, and one of the services is not strong enough. Under certain conditions, the memory crashes will cause downtime. But is this the most fundamental reason? Yes, can you ask a few? What is abnormal situation to have a memory crash? Then analyze the investigation, it is found because the system is incomplete when the system is querying the database, and some fields are unexpectedly empty? Still not perfecting the legitimacy check after query? Or is the database limit? ... Everyone is the challenge of yourself, in the process of analyzing the trouble, there is no fear and slack psychology, only one, only one, the spirit and courage to chase the bottom, can explore the truth of the truth . Otherwise, not only the fault cannot be effectively resolved, but it may also trigger a new fault. 5.3. Fault, always need you to adhere to the spirit of independent thinking, keep a clear mind. When you encounter a problem, you are looking for help and essential assistance. The experts here may be the experience of manufacturers, may be senior people in the professional technology forum, maybe your familiar friends, maybe your colleagues, maybe your technical director, or a technology big cattle of your company . In short, these different types of people may always make some judgments according to their own experience, give you suggestions, and even tell you directly, "Don't find it, this is this ...". These recommendations may have the same, similar, different, and even contradictory. what should you do? Which is right? Who should I listen? Some people are from the manufacturer. "The people of people are said to be like this ..." "Some people only have the same level." The × × manager said, the reason is this ... ".
In terms of technology, there is no high and low expensive points, and it should be virtual. Analysis of people who solve problems are you, stick to independent thinking, don't say that you die, no one is asking, but said, can not blindly from others; not only want to know, but also to know how it is, in your consultation expert When you want to ask why, let him tell you the reason. At this time, don't die, you want to be face, then think, just the truth, as long as you don't understand, even if you ask, even if you are not familiar with, the real expert is able to explain to you in a simple technical language; If you really can't understand, you explain the expertise of experts in the original expert, then go to check the information or ask other experts. Anyway, for the recommendations given by experts, you have to understand the intrinsic truth, otherwise, I don't know if the consequences of it often mislead your ideas. Experts will always be recommended, only you can give conclusions. It is necessary to Bochi, don't just use it; you have to believe yourself, don't blindly worship.
6. Comprehensive knowledge structure 6.1. Failure, there is always no one to rely on others. Ask others to be good, but you have something to do. Others give you more tips and ideas, although occasionally can give you the result, but after all, it is occasional, most of the problems need you to decompose, go to synthesize, go to Lenovo, to verify, if others help is a Point, and you are going to pick, people who go to the uncomfortable unparalleled, choose different points, will connect into different lines, and the real route, there is only one. This ability is a higher level of ability and is higher to your requirements. Analysis issues are far from calling so simple, you have to learn, enhance the breadth and depth of knowledge. Only you really understand the system, understand your business, and only you can finalize the answer. 6.2. Fault, always happened to the part you think "transparent". The current system architecture is multi-layered architecture, which has clear levels and performance enhances inadvertently. Due to the upper layer platform, the lower platform "transparent" on the upper stage, the upper developer can only understand the technology, and the underlying technology does not need to be understood. For general developers, the hierarchical architecture can reduce the requirements of the developer, and he can think that the bottom layer is transparent. However, the real data still has to flow from the high layer through the low layer, from the low-layer high-level, any level of failure, the data flow will block, the system will inevitably fail, and the fault often happens to be asked in it. "Transparent" place. Therefore, from the perspective of fault analysis, the total person in charge must be comprehensive. For him, there is nothing "transparent". There must be a simple and clear understanding of each level. Any details are very familiar, but need a clear outline, which parts have passed through, all of which are logical, what processing? What kind of failure will occur if any of the links in any of them? Only in this way can you locate the fault point according to the phenomenon, discovers the cause of the fault as soon as possible. 6.3. Fault, always the ability to determine the time, determine the cost, and determine the customer relationship. The fault has happened, no one wants to solve quickly, the problem is unsupplying, the ability to analyze the problem comes from the daily point accumulation, the more you know, the more you have, your experience is rich, the problem is The faster. When you have a problem, you will not be anxious to hold the Buddha's foot, usually, you will learn more, summarize, increase your breadth and depth, the key moment, your ability will naturally reflect. 7. summarize the construction, maintenance, upgrade of the IT system, and then build a continuous process, during which different faults will also appear. Through the above analysis, we know that failure is not terrible, as long as we have fully prepared from psychological, working attitude, work skills, knowledge accumulation, any system issues will be solved. The emergence, analysis, and solving process of system failure, that is, the system continues to improve, continuously optimized, and is also the process of engineers to solve experience accumulation. After a period of hood, face any problems, you will be calm, calm.