The server needs memory as ordinary PC, but due to the influence of electromagnetic interference, voltage mutation, the server may have an error, resulting in the crash of the system. Although the chances of current memory error have been low, it is still not possible to meet the requirements of server stability, so the server's memory generally introduces various memory error correction techniques to reduce the chances of downtime. There are many modes of memory error correction techniques, mainly parity, ECC technology, and IBM Chipkill-Correct ECC technology. Let's briefly introduce these technologies.
The principle of parity technology parity is very simple. We know that data in memory is saved with binary, that is, the data is not 0 is 1, we can add a bit to store the data is 0 (or 1) The number of digits is the true and false of odd (or even). The specific approach is such that it is assumed to store a N bit data (X0 X1 X2 ...... XN), let's first perform different or calculated to get the check code c = x0 x1 x2 ...... xor xn, then The tonic is accommodated to the Nth 1 bit; when reading the data, the reads of the data read out will be different or the check code (Not c xor x0 x1 x2 ... xor xn) If the result is 1, then the data has no error, otherwise the data has an error and needs to be corrected. Since a byte is 8 bits, it is generally taken a parity bit every 8 bits. The disadvantage of parity is also obvious. If the error in the data is the sum of the sum of the numbers, it cannot be discovered when the number is even, and the parity can only find errors, can't check and correct the error, so ECC memory verification technology is also Shun it. ECC Technology ECC (ERROR CORRECTING CODE / ERROR Checking and Correcting) ECC, which is the same as parity, require additional space to store calibration code, but its occupied bits follow the length of data It is not a linear relationship. Specifically, an 8-bit data generated ECC code takes up 5-bit space, and a 16-bit data ECC code only needs to increase one, which is 6 bits, and 32-bit data is 7 Bit ECC code, so push. Obviously the length of the ECC code is a logarithmic relationship with the length of the data, and when the data length is more than 64 or more, the ECC code will highlight the advantage. In addition, the maximum advantage of ECC verification is that if there is an error in the data, it can not only find it, but can be corrected, the ECC check can also find 2 to 4 incorrect errors (not corrected), of course, the chance of such a situation is Very low. The verification of the ECC code is complicated than the parity, which requires a special chip to support, so the normal computer motherboard does not necessarily support the ECC memory check, and because the system takes time to wait for the results of the verification, ECC check It will reduce the system speed of about 2% to 3%, but this small cost is exchanged for system stability. It can be very worthwhile. CHIPKILL-CORRECT ECC technology With the increase of server operation, the increase in memory capacity, the chance of memory error is also rising, according to statistics, the chance of errors in three years in three years is 7%, which The stability requirements of the enterprise-level server are far less, so in 1997, IBM took the lead in proposing Chipkill-Correct ECC technology, which could drop the chance of errors within three years of 1GB memory system to 0.06%! Its principle is similar to the hard disk's RAID technology, so it is also called RAID-M (Redundant Array OfpenSive Drams - for Memory) technology. The specific practice is to synthesize a "ECC Word" with a number of bytes of data, and then store it separately from several different memory. Therefore, its fault tolerance is greatly enhanced, and the storage method is a bit like RAID5 technology.