Basic positioning method of RS6000 minicomputer failure

xiaoxiao2021-03-05 50

Poor, the company's RS6000 can't start, can't get people to solve, let me get it, I can change the AIX, hey, Google has an article, let's put it here. God bless.......

Author: nicolaszhou references at: http: //blog.blogchina.com/article_115800.1295168.html

First, the definition of the fault. What happened in the system. What can the system do now? Can't you do? When is the failure? Does anyone do different operations? Is there any regular failure? Timing is still not required? How high is the frequency? Is it a machine failure or multiple machine failures? Is the fault? Have you made a change recently? If new hardware, software is installed, some settings have been changed. Second, the collection of fault information 1) It is important to collect fault information for judgment, diagnostic failure, repair system is very important. 2) System fault recording (ERRORLOG). The Errdemon process is automatically run when the system starts, and records hardware, software, and other information. The fault record file is / var / adm / ras / errlog, you can back up or copy to other machines. The use of Errpt commands (ordinary user privileges can also be used): #errpt | more lists short error information T (type): p is permanent; T temporary; u Unknown (permanent error should be attached) C (Category): h Hardware; S software; O user; u unknown #errpt -d h lists all hardware error information #errpt -d s list all software error information #errpt -aj error_id lists detailed error information # errpt -aj 0502f666 <- - Error_ID can be 3) Control the LED code on the control panel. 8-bit code, usually the system fault light will illuminate simultaneously. Some models also display failed device location code. . 4-bit code, usually exxx. . 3-bit code, usually 0YYY, only 3 afterwards. .8-bit and 4-bit code View the System Service Guide. 3-bit code can look at the Diagnostic Information for Multiple System. Flashing 888, system crash, hardware or software reasons. Press the RESET key to display more. 888-102 Generally a software failure (888-102-207 exceptions) will produce a DUMP. 888-102-xxx-0c9 system is doing DUMP, please wait. 888-102-xxx-0c0 system DUMP is complete, it can be turned off. 888-103 or 105 hardware failure, usually SRN code and location code. 4) SMS (System Management Service "fault logging How to enter the SMS menu: When the keyboard icon appears in the main console (LED display E1f1) Press 1. Select "Utilities" to select "Error Log", copy 8-bit fault code (can change the system boot sequence table in SMS) 5) The Mail #mail system will send Mail report error information to the root user. Usually there is no check repair after the system has failed, and the system will remind root. 6) Run the fault diagnostic (Diagnostic) to check and diagnose system hardware. When you find a hardware failure, you should use Diag #diag> Advance Diagnostic> select Problem Determination (SYSTEM VERIFIC) (select PD) DIAG run The SRN code, the faulty device name and percentage, address code, etc. will be given. For the PCI model, you should run the DIAG program in the system error 7 days to analyze the SENSE data in the error record.

7) Other commands for collecting system information LSDEV -C system device information #lsdev -cc disk LSPV View physical volume information #LSPV LSVG View Volume Group Information #LSVG DataVG #LSVG -L Rootvg LSLPP View file group information #LSLPP -L | GREP 23100020 LSATTR View device parameter setting #LSATTR -EL ENT2 LSCFG View VPD Information (Virtual Product Data) #LSCFG-VL SSA1 Different hardware devices have different VPDs, the formats and information contained are different. Usually the spare parts and microcode versions are most reference value. Note: The FRU (Field Replace Unit is the real spare number. Third, the hardware fault location method IBM small machine fault location method includes CheckPoints information on the display panel on the small machine I / O cabinet, Error Code, and SRNS. The checkpoints checkpoint is a series of information displayed on the I / O cabinet after running the system powered CMOS initializer (IPL). IPL processes When the AC power is connected to the system, the IPL process begins, and the IPL process includes four steps: Phase 1: Service Processor's initialization Phase 1 begins after the AC power is connected to the system until OK is displayed on I / O cabinet Since the display panel. 8xxx or 9xxx checkpoints code is displayed in this step. Phase 2: Hardware initialization Phase 2 starts by Service Processor Starting on the white power switch on the I / O cabinet. 9xxx checkpoints will appear in this step. 91ff is the last code marks the beginning of the third step. Phase 3: The initialization of the system firmware Phase 3 A system processor takes over control and continues to initialize system resources, and Exxx will be displayed in this step. E105 is the last code marks the beginning of the fourth step AIX start. Various positional codes are also displayed during this process (the location code represents each part of the system). Phase 4: AIX starts When AIX starts start, the code on the display panel is 0xxx, and the location code will appear in the second Row. When AIX's login window appears on the fourth step to end the display panel, no information appears on the display panel. Error Code When the system runs an error discovery, an 8-bit code is displayed on the display panel while the position code of the corresponding problem hardware is displayed in the second line. SRNS (Service Request NumBers, Service Request) When the system runs an error discovery, the SRNS code is displayed on the display panel in the form of XXX-XXX, and also records in the ERROR log in AIX. All of the above code will have the corresponding steps to solve. Due to a wide range of code, please record the code after the problem occurs and call the IBM service hotline. System startup sequence: The system cannot initiate the system stopped in Stage 1, which may be a power supply, system board, CPU, memory and other hardware failures. Record the fault code to inform the IBM engineer. The system stops 2, which may be a bootlist corruption or an I / O subsystem failure. Try to enter the SMS menu to check the startup sequence table and modify it. If there is no hard disk information that is available or displayed when selecting bootlist, it may be a hard disk failure. If there is no SCSI device, there is a problem with the optional link. The system is parked at Stage3, which may be hard drive data corruption, system setting file error, or I / O subsystem failure.

The system stops at 551, 555 or 557 in the third phase of the system started (STAGE 3), which may be: File system corruption, file system log (JFSLOG) corrupted, there is a bad hard disk in rootvg, repair method: system disc or system Backup belt started (must be consistent with the operating system version in the hard disk), start selection option 3 "Start Maintenance Mode for System Recovery"> Access A Root Volume Group> Access this Volume Group and Start A Shell Before Mounting The File Systems "Format File System Log (JFSLOG) # / usr / sbin / logform / dev / hd8 Check Repair File System # fsck -y / dev / hd1 (/ home file system) # fsck -y / dev / hd2 (/ usr File system) # fsck -y / dev / hd3 (/ TMP file system) # fsck -y / dev / hd4 (/ file system) # fsck -y / dev / hd9var (/ var file system) ... Exit with the exit command, the file system automatically automatically. Rebuild BootImage # lslv -m HD5 to find the hard disk where BootImage is located, such as HDisk0 # bosboot -ad / dev / hdisk0 # bootlist -m Normal / dev / hdisk0 rebuild start sequence table. Heavy boot system # shutdown -fr If the above steps do not work, use the system backup with recovery system. If the backup belt cannot be recovered, check if the Diagnostic CDROM is checked. The CDE graphical interface hangs the CDE Run Do not change the network parameter (such as hostname and IP address) Change the NIC settings, exit the CDE graphics environment, select the command line to log in, change under the character interface. If the CDE has already hangs the remote telnet login, find out that all DT related processes can be killed with a kill command. # ps -ef | grep dt ... # kill PID Check the current host name # hostname TSCF50 View Host Names Whether to valid IP address # netstat -i | grep TSCF50 TR0 * 1500 9.185.4 TSCF50 506049 0 28247 0 0 Change the host name or IP address to make the host name correspond to the currently valid IP address. # itty tcpip Restart CDE Interface # /etc/rc.dt HACMP Environment Motor Name Alias to 127.0.0.1 # cat / etc / hosts 127.0.0.1 loopback localhost tscf50 # loopback (lo0) Name / AddressBVG system DUMP When the system crashes, AIX will do DUMP (snapshot of system memory). At this time, the machine will display flash 888 102 XXX 0CX code: 0C9 system DUMP is in progress. 0c9 status may last for more than 2 minutes, do not turn off and press RESET, wait for DUMP to finish. 0c0 DUMP successfully completed, at this time, it can be powered up. 0c2 manually starts DUMP function. 0C4 DUMP equipment is insufficient, only some information is saved. 0C5 unknown causes the DUMP failed.

转载请注明原文地址:https://www.9cbs.com/read-37292.html

9cbs

New Post(0)