At present, the XX traffic police has gradually achieved some work automation and information. With the implementation of the "XX Traffic Police Information Management System", the Work of the XX traffic police will further implement informationization. We will get rid of past simple and physical labor within a specific range, and most of the cumbersome work is done under our understanding. In the future, our work will depend to a large extent on the information provided by the computer, which is data. The completeness of the data will directly affect our daily work, and even bring us unable to compensate. If the loss of the accident file, the chaos of financial data, and the disappearance of item information. In general, the factors affecting data integrity have mainly have the following five types:
l hardware failure
l Network failure
l logic problem
l accidental catastrophic event
l Artificial factor
1, hardware failure
Any high-performance machine is impossible to run without any failure for a long time, which also includes a computer, a common hardware failure that affects data integrity:
Disk failure
2. L I / O controller failure
3. L Power failure
4. L Memory failure
5. Larges for media, equipment and other backups
6. l Chip and motherboard failure
2, network failure
On the LAN, the data is transmitted between the machines through the transmission medium, and the cable used to connect the machine device is always in a variety of threats in interference and physical damage, making the computer difficult to communicate or unable to communicate. Events, eventually leading to damage or loss of data. The fault on the network usually has three aspects:
l Network interface cards and drivers are actually indivisible. In most cases, the network interface card, the failure of the driver does not impair the data, and only the user cannot access the data. However, when the network interface card on the network server fails, the server usually stops running, which is difficult to ensure that those files opened are corrupted.
l The pressure caused by the data transmitted in the network is often large. Network devices, such as buffers in the router and bridges, are not large enough, resulting in loss of operation, resulting in loss of packets. Conversely, if the buffer capacity of the router and the bridge is too large, the delay caused by this large amount of information may result in session timeout. In addition, the incorrectness of the network wiring is also possible to network faults, affecting the integrity of the data.
l The radiation itself has an electron movement, so radiation will cause damage to the data. Controlling radiation, using shielded twisted pair or fiber optic system for network wiring.
3, logic problem
Software is also an important factor in threatening data integrity. Due to software issues, there is a few ways to affect data integrity:
l Software error
l File damage
l Data exchange error
l Capacity error
l in an inappropriate demand
l Operating system error
Here:
Software errors include diverse defects, usually related to the logic of the application.
File damage is due to some physical or network problems caused by file being destroyed. Documents may also cause damage due to some defects in system control or application logic. It is quite annoying that if the damaged file is called by other process calls, these newly generated data is wrong, which is a problem that is difficult to cope.
When the new file produced during the file conversion process, the data exchange error is generated if there is no correct format.
When the software is running, the system capacity, such as memory, etc. is the cause of an error.
All operating systems have their own mistakes, which is well known, no more strange. In addition, the system's application interface (API) is used by a third-party developer to serve end users, which writes its software products based on the publicly released API function, if these APIs are not as they say, will happen The data is corrupted.
4, catastrophic event
Common catastrophic events are: l
l fire
l Storm - tornado, typhoon, blizzard, etc.
l Industry accident
l Deliberately destroy / terrorist activities
5, human factors
The impact of human activity data integrity is multifaceted. Common threats brought by human giving data integrity include:
l accident accident
l lack of experience
l Pressure / panic
l Communication is not smooth
l Deliberate baggage destruction and stealing
In two ways to improve data integrity: First, use preventive technical prevention to endanger data integrity events, secondly, once the data is inteacted, effective recovery means, restore Damaged data. For the specific circumstances of the XX traffic police, we give some technologies that recover data integrity and loss of data integrity:
l Backup
l mirroring technology
l Archive
l dump
l Hierarchical storage management
l parity inspection
l Disaster recovery plan
l Pre-analysis before the fault
l power adjustment system
l System security program
Backup
Backup is the most common way to restore an error system or prevent data loss. Usually, BACKUP is a backup operation, which copies the correct data to media, etc., in case the data integrity of the system has been damaged, and the backup system can be used for the most recent system. Backup is restored to the machine. As a network administrator or system, there is no excuse to be backed up.
Mirroring technology
Mirroring technology is a physical mirroring principle specific application in computer technology, which is referred to whether data is copied from one computer (or server) to another computer (or server).
Mirroring technology generally has two methods in the computer system:
Logically to copy the file system in a computer system or network system to another computer or server in the network.
It is strictly carried out on the physical layer, such as the mirror of the disk drive, I / O drive subsystem, and the entire machine.
Archive
In the computer and its network system, archive has two layers: 1. Copy the file from the wire memory of the network system to the tape or optical medium for a long time; second, copy the file from the network online memory copy At the same time, the old file is deleted, making the remaining storage space on the network larger. By archiving, you can transfer files on the online memory to the permanent media to strengthen the protection of the file system.
Dump
The dump is the same as the backup, but there are different places. The dump refers to how the tape used to recover is elsewhere. This is the most different from the backup.
Parity
Parity provides a monitoring mechanism to ensure that unpredictable memory errors will not cause loss of server errors to cause data integrity.
Disaster recovery plan
The damage caused by disasters to computer landing systems is huge, and the disaster recovery plan is how to rebuild the system in the ruins.
Pre-sign analysis
The pre-fault pre-standard analysis is based on the damage or aging of the part. It is not completely destroyed, but there is a process. In this process, the number of mistakes is increasing, and the action of the equipment has begun to become a bit incredible. According to the analysis, judgment The symptoms of the problem in order to prepare for exclusion.
System security program
-------------------------------------------------- ------------------------------
2. Network backup system
A network backup system has a goal that data and system information needed to recover the computer or computer network system as much as possible.
Network backups actually refer to the file backup of each computer on the network, which actually contains a set of backup systems for the entire network system. It mainly includes the following aspects:
l File backup and recovery
l Database backup and recovery
l System disaster recovery
l Backup task management
Since the complexity of the LAN system increases with the increase in various operational platforms and network applications, it is necessary to increase the difficulty of complete backups made by the system. It is not a simple copy to solve it. Adjustment, the simple requirement is compromised. 2.1. Types of backup and recovery operations
For most network administrators, backups and recovery is a heavy task, careful every day, do not dare to have a little flash. The most basic question of backup is: To ensure that all systems can restore all systems, how much is needed and when to back up?
2.1.1 Backup
All-round backup
The so-called full backup is to write all files to backup media. The process of full backup is more popular, as this is the most direct-acting method that overcomes the unsafe of the system, and it is more simple to operate. Through this method network administrator can clearly know that the information on the network system can be restored from the day of the backup. Despite this, there will often be many data to make network administrators can't make a full backup on weekends every day. In fact, for many considerations, including the difficulties on the equipment, few people like to make a large backup of the Grano.
Incremental backup
Incremental backup refers to only files that have been changed after the last backup, that is, the updated file is backed up. Incremental backups are the most effective way to back up. Imagine if you only need to do incremental backups every day, in addition to the big time, the system's performance and capacity problem can also be improved.
Anything has its own dual, and the incremental backup has a lot of advantages, but it usually relies on the system properties of the file to identify the changed file, which is sometimes unreliable. Of course, this unfavorable side may identify newly updated files by establishing a file system database or some record, which may be precise and reliable, but it is more likely to cause other system issues, some issues may not be foreseen. Another problem with incremental backups is that the time used from recovery data from all tapes may be very long.
For an experienced network administrator usually uses the incremental backup and the entire backup, so you can provide a quick backup. This method can reduce the number of tape required when recovery.
Differential backup
Differential backups refer to a method of updating the file after the backup of the backup. It is similar to the incremental backup, which is just all the files updated after the full backup after the full backup of each day. Therefore, the time required for daily backups will be longer than one day before the next full backup.
Differential backups can be done according to the changes in the data file properties, or in accordance with the tracking of the update file.
The main advantage of differential backups is that all systems only need to be two sets of tapes, which can recover - the last one-round backup tape and the last difference backup of the tape.
Backup on demand
The so-called on-demand backup refers to additional backup operations outside the normal backup arrangement. This backup operation is actually encountered. For example, just want to back up a number of files or directories, or as long as you back up the required information so that you can make a safer upgrade.
On-demand backups can also make up for redundant management or long-term dumps.
exclude
Troubleshooting is not a way to back up. It just guarantees files that do not want to back up will not be copied to a class of methods. The reason, these documents may be large, but it is not important; it may also be considered for technical considerations, because there is always a good way to eliminate this fault when backing up these files.
2.1.2 Restore operation
Recovery operations can usually be divided into two categories:
l Full backup recovery
l Individual file recovery
l In addition, there is a recovery operation called redirection recovery.
Full recovery
The full recovery is usually used after a disaster event or when the system is upgraded and incorporated.
The way is simpler, and only the information of the given system stored in the medium is dump all of them. Several group magnetic bands can be used in accordance with the different backup methods used.
According to experience, the last tape for backing up is generally used as a tape used as the recovery operation. This is because this tape holds the files currently being used, and end users are always eager to use them after system error correction. Then use the last full backup of the tape or any tape in which the most file is located. After this, all related tapes, the order doesn't matter. The latest error registration file should be checked after the recovery operation is checked in time.
Individual file recovery
The grass recovered by individual files is more common to recover more than the requirements. The reason is nothing more than the high level of end users.
Typically, the user needs the last version of the file on the media, because the user just broke or deleted the online version of the file. For most backup products, this is a relatively simple operation, they simply browse backup databases or directories, find the file, and then perform a recovery operation to reach the purpose. There are also many products that allow selection files from the list of media logs to recover operations.
Sometimes I need a more old version for something. For this situation, because most of today's software products provide this type of function, some of which are better than another product. Sort by file system To select files of files, products are usually faster than those who choose older versions of files with login. This is because the file system list only needs to be searched; and the login method is required, you need to browse each login record until the correct version of the file is until you find the correct version of the file. Establishing a login index can reduce this problem.
Redirect recovery
The so-called redirect recovery refers to restores the backup file to another or different systems, rather than the original position where the data is located when the backup operation is performed. Redirect recovery can be a chess board recovery or individual file recovery.
In general, recovery operations are prone to problems over backup operations. Backup operations only copy information from disk, and recovery operations need to create files on the target system. When establishing a file, there are many other things may be wrong, including capacity restrictions, permissions issues, and file coverage. .
Backup operations Don't know too much system information, just copy the specified information. Recovery operations requires that those files need to recover which files do not need to be recovered. For example, a large application software is deleted, and a newly installed application has occupied its original position, but also assumes that in a certain day, the system has problems, it is necessary to recover from the tape, which will find the removal of the old application software. It is important to recover the restoration operation, so it does not restore the old application software and restore new applications, so that the server's hard drive is used and the system will crash again. In addition to finding people in consultation, you need to first understand how the backup software solves such problems, and you can't think of it, you are smart!
2.2. Composition of network backup systems
Backup is very simple from the surface, but in fact, the functional backup and recovery software is required, which still contains a lot of complexity. In order to have a thorough understanding of the network backup, the following will be described in the composition of the network backup component and the network backup system.
2.2.1 Network Backup Composition Parts
Network backups have the following four basic components:
l Target goals refer to any system that is backed up or recovered
l Tool tool is a system that performs backup tasks (such as copying data from target to tape)
l Equipment devices typically refer to cables and connections that connect the device and the networking computer. The SCSI bus in the LAN backup typically connects the device and backup tool.
Basic backup system
Basic backup systems have the following two:
l Independent server backup
l Workstation backup
Independent server backup is the simplest backup system, which is constructed together to connect the above four components. The system includes a server that puts itself to a SCSI tape drive.
The workstation backup method is evolved from the independent server backup, which moves tools, SCSI buss and devices to a dedicated workstation of the network.
The server-to-server backup server to the server's backup system is somewhat similar to the standalone server backup and workstation backup, which is the most common way to use a local area backup method.
Special network backup server
Some departments or institutions are often placed on dedicated server systems, considering that the production servers that have brought backups are faulty or other issues, and some departments or institutions often put tools, SCSI buss and dedicated server systems. This method is similar to workstation backup, just from backup The performance and compatibility of the system will consider the work station to change the server.
2.2.2 Composition of Backup System
Backup is a system, it has the following components:
l Physical host system
l logical host system
L i / o bus
l external equipment
l Device Drive Software
l Backup storage media
l operation scheduling
l operation execution
l Physical target system
l logical target system
l Network connection
l system log
l system monitoring
l System management
The components of these backups must work together to form a reliable system. When modifying the backup system, it is necessary to ensure that the load on each component is balanced.
The components of the backup will be discussed in detail below.
Physical host system
The physical host system is the main backup logic that is executed thereon. It can be a high-performance computer or a UNIX workstation or any hardware for backup. Since the physical host system is a hard device, its CPU and I / O bus are allowed to vary, so the performance of the backup will be limited by the machine itself.
Logic host system
The logical host system actually serves the operating system in the backup system. OS provides I / O function according to its own structure. The quality of backup performance is a big relationship with the operating system.
I / O bus
The I / O bus is the internal bus of the machine, including the external bus of SCSI, which has been described above. The internal bus is used to transmit data, while the external bus is used to connect the storage device.
Currently, in most PC systems, including systems such as the high-speed bus structure such as EISA and PCI, the speed of transmission data is less than 5 Mbps. If the bus speed reaches this limit, it indicates that the system bus has become a bottleneck, and it also shows that the speed of storage hardware is enough.
In the part of the UNIX system, there is a faster bus structure, approximately 15 Mbps per second.
The most common external bus for storage devices is SCSI. It is worth noting that most SCSI bus speeds exceed the system bus speed.
Another bus is a PCI, which is a structure that can be adjusted to accommodate high speed data transmission.
SCSI technology comparison
SCSI bus speed (MBPS) number can be connected
Conventional SCSI, 8-bit transmission 5 7
Quick SCSI (with enhanced agreement) 10 7
Broadband SCSI, than bit transmission 10 15
Fast Broadband SCSI 20 15
Super high speed SCSI, 8-bit 20 15
Super high speed SCSI, 16-bit 40 15
Multiple SCSI devices can be used in a single SCIS PC adapter with a technology called a "daisy chain".
peripheral equipment
The peripheral device refers to a device that can read and write data on a tape drive, a disk drive, an optical disk drive, a RAID system, and the like. Most of these devices are slower than the system bus, and there is no transmission speed that can fully use the SCSI bus.
Device driver software
The device driver software is a low-level code with the device interface for controlling the operation of the device. Advanced SCSI Programming Interface, Advanced SCSI Programming Interface, is the factual standard in the PC network market. Therefore, it can be said that all backup systems support ASPI.
Different equipment drivers may have a great impact on the performance and reliability of the SCSI system. In general, replacing the SCSI driver is not a good idea unless there is sufficient reasons.
Backup storage media
The storage medium in the backup system mainly refers to tapes and discs. They are actually unlikely that the devices for reading and writing operations. Backup plan
The backup plan is what you need to do when you back up every day; how to back up any data. Some backup systems have provided many planned flexibility and automation for backup operations.
Operator
Operators, also known as backup tools, which is a set of backup operations, that is, programs responsible for most of the work in backup operations, which directly affect the efficiency of operation and even affect the recovery operation.
Physical target system
The physical target system refers to the machine that is copied from it. Like the hardware platform of the backup host system, the hardware platform of the target machine can also affect the performance of the backup.
Logical target system
The logical target system is also known as the agent, which is of course operating the operating system and application software. For backups, the logic meaning of the target is a response to the requirements of the executor. The main task of the agent is to provide files and other system data to the backup tool through some way. As a logical target system, you must master the details of the target file system and other system data that are not in the target file system.
The target software of the quality of the quality will have a serious impact on the overall performance of the backup operation, and even causing the crash of the backup tool. A very slow running speed will affect whether the backup work can be done on time.
Internet connection
Network connections can be routers, bridges, switches, hubs, cables, or anything else to connect to anything between computers on the network. When the data is transmitted in the network, there is a common phenomenon if the network device overload is running and starting a packet, including file corruption, loss of goals, and even malfunction of the backup system. Because of this, there are some real understanding of the backup system load on the network before investing in the network connection device.
Network protocol
The network protocol includes IPX / SPX, TCP / IP, etc. What kind of service is implemented on the network, as well as the reliability of these services, it is also one of the headache problems existing in the LAN backup, which sometimes reduces the performance of the backup, even leads to communication sessions. The process is closed or failed to cause the backup system that is difficult to expect.
System log
The system log can be understood as a database file. It records which files are backed up. When are they backed up, what is the system properties of these files, and the backup tool developers think important any information about any information .
System monitoring
System monitoring is an administrator interface. The interface runs on the client platform of the GUI interface in the network system of the client / server structure, and the backup storage device is connected to the server. When the backup work is performed, since the monitor needs to transmit data on the network to increase the network additional load, the backup system performance is reduced. Therefore, if you do not need to monitor the backup work, it is best to turn off the monitoring interface of the backup system.
System Management
With the expansion of the scale of the network system, the status of its backup system is required to becomes more and more important on the network. Therefore, this function network can be completed into a need to observe the situation of backup operations, providing details of the backup. In addition, any warnings or other issues can also be discovered by simple network management protocol - SNMP (Simple Network Management Protocol).
2.3. Backup and recovery equipment and media
The equipment and media used for backup and recovery in the backup system are:
l tape media
l optical media
l tape media
Why is the tape medium being used as an important backup device why:
Tape has better magnetization characteristics, easy to read it, write data
Data on the tape will not be influenced by other data on the same tape adjacent to the same tape.
The layers of the tape cannot be separated or peeling in each other.
Tape has a good tensile strength, not easy to be broken
The tape has a good softness, which ensures that it can be rolled up with the tape drive, which can be easily bent.
For the above reasons, the tape is selected as dedicated to data records.
The tape recording method for data requires some perfect error correction techniques to ensure that the data can be read and written correctly. Usually 30% of the tape surface is used to save error correction information. When the data is successfully written into the tape, the error correction data is written together to prevent the tape from failing before using it. If the original data on the tape cannot be read correctly, the error correction information is used to calculate the value of the lost byte; if the tape drive driver cannot rebuild the data, it will issue an error message to the SCSI controller, the warning system appears. Medium errors. During the process of writing the tape, it is necessary to perform a test after another head to ensure that the data that is just written can be properly read. Once this test fails, the tape will automatically enter a new location and start a billion attempt again. After rewriting several times, the drive will give up and issue a fatal medium error message to the SCSI controller. At this time, the backup operation fails until the new tape is loaded into the drive.
Tape type
Tape can be divided into several of the following techniques:
QiC (Quarter-Inch Cartridge) represents a 1/4-inch cassette tape. This medium is seen as a low-end solution, capacity, and low capacity and speed of independent backup systems, and cannot be used for the LAN system.
4mm tape, referred to as DDS. The storage capacity of this tape can reach 4GB. DDSII can reach 8GB capacity.
8mm tape, its capacity is not compressed to reach 7GB. Ultra-long belt (160m) can reach 14GB. This type of tape is interchangeable than 4 mm.
Digital Linear Tape or DLT. This tape is better than the performance and capacity of this tape. The DLT2000 can be written to 10GB of data, which can reach 20GB in compression; the DLT4000 has 20GB of capacity. Use compression technology to reach 40GB
3480/3490, it is used in high-speed device media in the host system.
Tape maintenance
The data saved on the tape is a wealth, a resource, and therefore, the maintenance of the tape device media is not available. Usually the maintenance of the tape device media is as follows:
Regular cleaning tape drive
The tape stored on hold is at least a year "operation", which keeps the softness of the tape and improves its reliability.
When the backup system receives more and more tape error messages, you should first doubt whether the magnetic head has failed and cleaned the head cleaning. If there is still a lot of errors, it is necessary to consider replacing the new head.
Optical medium
Optical medium technology is to identify laser light reflected from the surface of the medium into information. 0 and 1 on the optical medium are reflected in different ways, so that the optical drive can emit a bundle laser on the rail and detect the difference in reflected light.
Magneto-optical medium
Readable CD (CD-ROM)
Magnetic discs, or Magnetic-Optical, is a medium having the best persistence and wear resistance in all existing media. It allows very fast-moving data random access, which is this feature, MO is particularly suitable for the hierarchical storage management application. However, since MO's capacity is still not compared to the tape, it is not widely used for backup systems.
Readable CD, that is, CD-ROM, currently because of the technical difficulties encountered by the speed of the multi-process medium, it is still unable to adapt to the requirements of the network backup.
Improve the technology of backup performance
When a large amount of information is backed up, performance has become a very important issue. Techniques that are used to improve network backup performance:
l RAID technology
l
l tape interval
l compression
l raid
Tape is a device medium commonly used in backup systems. The time required for the tape on the recording head is a bottleneck, which is an important factor affecting the speed of backup, and an effective way to solve this bottleneck problem is to use a tape RAID system. The concept of the tape RAID is similar to the disk RAID. The data is "strip" through multiple tape devices, and therefore, a particularly fast transmission rate can be obtained. However, since the tape is always walking during the operation, once the driver waits for the next data, it is often caused by a sharp drop in the speed, which is a lot of shortcomings in the RAID method. In addition, this method has a reliability problem in data recovery operation, because it is necessary to accurately position and timing to multiple tape devices to properly recover data, which is a more difficult task. However, this technique still has a hoping to use to require the highest speed and capacity. Equipment flow
The device flow refers to the status of the tape drive when moving the tape at the optimal speed, and only the magnetic tape drive is in the state to achieve optimal performance. Obviously, this requires all devices in the tape RAID system to work in a flow state.
To this end, the SCSI Host Adapter must continuously transmit data to the device buffer. Unfortunately, most LANs' transmission capabilities are not enough to provide sufficient data to the backup application to make the buffer's buffer in full load. That is to say, equipment streaming technology can improve the performance of the backup, but it is difficult to keep the device in a 100% flow state.
Tape interval
The tape interval will connect data from several targets together and written on the same disk in the same drive. This is actually written on the tape together. This solves the problems mentioned above.
compression
A device with built-in compressed chips can improve the performance of the backup. These devices are first compressed when they write data on the media, and the level of performance has been increased slightly or equally than the compression ratio. For most data on the PC LAN, the compression ratio can reach 2: 1, which means that the flow rate of the device is twice as compressed when compressed data!
In addition, the backup performance can be improved through the performance of the network itself. In a large backup system, the SCSI controller can be used to increase the operational efficiency of the SCSI device, but over more equipment on the SCSI host adapter, the number of devices that are usually connected is not more than three.
2.4. Tape rotation
Tape is undoubtedly the main backup media of the network backup system, and the tape rotation is one of the primary issues encountered when establishing a backup strategy. The so-called tape rotation is actually a method using a tape during the backup process, which is determined which tapes should be used in accordance with certain pre-established methods. Since the data is stored in the tape, once the data is required to recover, if the amount of information is not large, the tape of the storage information is not much, in which case the backup belt may not be too large; if The number of tapes that store data is more, then build a system for tape management will be very useful, and the recovery of data is very helpful.
The main function of the tape rotation is to determine when using the new data override the data previously backed up, or in turn, the backup tape in which time period cannot be overwritten. For example, the tape rotary policy specifies that the last day of the month is saved for three months, then the tape rotation can help ensure that the data before the past three months will not be written on these tape. This helps reduce people that may occur during backup operations, not inserting the wrong tape into the drive in the error time period, resulting in loss of data or data damage.
Another benefit of tape rotation is to use automatic tape systems. The automatic belt system and the tape rotation rules are combined to use the errors caused by people, making the recovery operations predictable.
The tape rotation is mainly as possible:
A / B rotation, in this manner, a set of tape is divided into two groups. "A" even day, "b" is used in odd day, or vice versa. This way does not save data for a long time. Turn once a week, this way is changed once a week. This method is very effective when the data is small.
Daily rotation, it requires replacement tape every day, which requires 7 tapes indicating Monday to Sunday. This way, it is more effective when combined with a total backup and differential backup or incremental backup.
Monthly rotation, its usual implementation method is the monthly start to make a full backup, then in other tapes on other tapes in other tapes in the month.
The ancestors, the father, and the sun, it is a combination of daily, weekly, monthly rotation.
The calendar rule rotation method, which is a rotation of the calendar arrangement media. According to this method, the time saved for each operation is set, not a preservation time for a group of phased tape.
Mixed wheel exchange, this is a backup on-demand as a supplement to day-to-day backup
Unlimited increments, the method of this mode is to do a full-cut backup, which is just to perform incremental backups after the first running the system. When restored, the system can combine data multiple backups and write to other larger media. This mode is to operate normally to operate with an exact database.
In addition to the above-mentioned tape rotation modes, there is also a differential operation, Hanno Inta rotation mode, etc., this is not described here.
2.5. Design of backup system
Network backups actually refer to the file backup of each computer on the network, which actually contains a set of backup systems for the entire network system. Therefore, in the design of a specific network system, it is necessary to make a detailed analysis of the current status of the network system. On this basis, the design of the backup scheme is proposed according to the actual backup requirements, and it is not possible to think of course, or the system of others. copy.
Systematic analysis and backup requirements
The content of the current status of the system includes:
l Network system operation platform
l The database management system used by the network
l-running applications on the network
l Network system structure and the selected server, etc.
l The requirements for the network backup system are:
l The time required for backup needs to be retained.
l Whether the backup of the database requires an online backup
l Requirening backups at low cost on different operating platform servers
l Do you need a set of automatic recovery mechanisms
l Requirements for recovery time
l The requirements for the operation of the system monitoring program
l Requirements for automation of backup system
l Requirements for information spare parts for online front desk workstation
l Describe the backup measures that have been used, etc.
Backup solution design
A complete set of backup solutions should include the selection of backup software and backup media, as well as daily backup systems and disastrous emergency measures.
Backup software
The choice of backup software is critical to a network system, and its choice must meet all the needs of the previous needs.
For a PC LAN, CA's RAC Serve can be said to be a better backup software that meets the above requirements. The CA's backup software uses the organization of the primary program to be organized, and the main program can complete most of the common functions, and more special features are completed by each option.
Backup media
Common backup media is of course preferred by tape. Of course, other media can also be considered according to the actual situation.
Daily backup system
The daily backup system, if determined that the tape is used as a backup medium, then one or several modes of which may be selected as the daily backup system according to the several modes described in the previous section - "Tape Rotation".
Realization of backup scheme
The implementation of the backup solution includes the following aspects:
l Install. Including application systems, backup software, and mounting of tape drives
l Develop daily backup strategies
l file backup
l Database backup
l Network operating system backup
l Workstation content backup
Backup scheme design based on CA ARC Server
Network system backup involves multiple aspects such as file backup, database backup, application backup, and implement cross-platform backups in most environments. Designing a reasonable backup solution is critical to the actual situation of the system.
Common backup methods have a centralized backup and a local backup. One of the network systems use which backup mode depends to a large extent on the size of the network. Centralized backups are more applicable to small network systems. The advantage of centralized backup is that hardware investment is small, simple, its main disadvantage is that the network speed is high.
For large network systems, the local backup method should be used, and the large network is divided into a number of small subnets, each subnet uses a centralized way backup. The advantages of local backups are not dependent on the network speed, the backup speed is high, and the response time is short. Its main disadvantage is that hardware investments are higher, and each subnet requires a backup system.
Introduction to CA ARC Server
CA ARC Server is a cross-platform network data backup software that provides comprehensive product support in data protection, disaster recovery, virus protection, and has become an industry in fact standard.
ARC Server has the following features:
l Full support and protection of NetWare and Windows NT operating systems
l Support open file backup
l Supports backups to various databases such as Sybase, Oracle, Betrieve, etc.
l Support a comprehensive network backup from the server to the workstation
l Scan a virus before backing up, you can achieve non-toxic backup
l Enable automatic backup of unmanned duty
l Support disaster recovery
The organization of Arc Server backup systems is the main module option. The master backup program only completes the universal backup function, and the more special backup function is implemented by the option. The Arc Server Master Module has two versions of Version --arc Server for NetWare and Arc Server for Windows NT, respectively, in NetWare and Windows NT.
CA ARC Server Backup Solution
surroundings
Two NetWare servers, one for file servers, another for database servers, running BETRIEVE. Require data and system backup of the entire network.
Program
Solution 1: Use the database server as a backup server, the ACR Server software is configured as: ARC Server for NetWare Diasater Recobery Option
It can be implemented as follows:
l Non-active file backup throughout the network
l Database Close Status Backup
l System Key Information (NDS or Bindery) Backup
l System disaster recovery
Solution 2: Use the database server as a backup server, Arc Server software configured as: ARC Server for NetWare Disaster Recovery Option Backup Agent for Betrieve
It can be implemented as follows:
l Whole network non-active file backup
l Database Open Status Backup
l System Key Information (NDS or Bindery) Backup
l System disaster recovery
Solution 3: Use the library server as a backup hardware using the library using the library. Arc Server Software Configuration is: Arc Server for NetWare Disaster Recovery Option Backup Agent for Betrieve Backup Agent for Open Files Tape Library
You can implement the following features:
l The entire network file backup, including active status files
l Database Open Status Backup
l System Key Information (NDS or Bindery) Backup
l System disaster recovery
l RAID fault to back up data
l unattended duty backup
l Install the backup agent of the corresponding platform on the workstation to implement data backup of Windows 95/98, Macintosh, and DOS platforms.
2.6. Misunderstanding of backup
For a long time, people have only viewed backups as a simple copy from disk, so ignore the importance of backup, there is a wrong understanding of backup. 2.6.1 Why is you back up?
Before discussing the backup misunderstandings, the reason why you want to back up the data before:
The physical damage of the electrical appliance, especially the damage of the hard drive, makes data loss
l Artificial errors, such as accidentally deleting files or reformating hard drives
l Hacker invades the computer network system remotely and causes loss of key data
l Hard drive or hard disk is infected by viral infection
l
l Natural disasters damage to network systems
l Power surges, damage data on hard drive
l Electromagnetic interference makes the document clearly
In summary, data backup is not a thing that can be done, and it is necessary to establish a backup for a computer system or computer network system.
However, there is a non-awareness of the data backup.
2.6.2 Misunderstanding of Backup
The backup misunderstandings mainly exist in the following three aspects:
l copy
l Disk array
l Use the backup command provided by the system
l copy
l Copy is a means of implementing data backup, but it is not all of the backup. because:
l Copy cannot save the history of the file
l Backup to save directory service records and important system information
l Disk array
Disk arrays (RAID) security is somewhat designer, but the RAID is wrong as a backup, its reason:
l RAID's main purpose is to ensure online, (instant information)
l RAID does not retain the second or more historical information
What should I do if the two disks are destroyed?
Cost, RAID investment is compared with the investment of the tape drive, the more obvious when the capacity is larger.
System backup command
Typically, the backup commands provided by the system familiar to the system, such as Backup, etc. are different from the backup. The main difference is expressed in the following aspects:
No fault tolerance
No openness
Backups that cannot be backed up for heterogeneous networks
For large, super large databases is powerless
3. Archive and hierarchical storage management
The hierarchical storage management (HSM) is two different ways to move data from the online system. In the information society, the growth of data is fast, and there are many, often referred to as information explosion. The continuous growth of data in the network system, puts such a very realistic problem to people; how to manage data in a network system? How can I manage the effective decisions of the network data most effectively. This is the content of this chapter.
3.1. Basic concept of archiving
The integrity of data is usually affected by online system issues. The system's own fault or error is a common impact of data integrity, but human error or human consciousness is destroyed by data integrity. One reason. One common way to minimize such problems is to move data from the online system to the off-line memory, transfer the data from the machine to the movable medium, which prevents the data caused by the data. Threat and damage. The archive is the same as a backup, is a valid and straightforward approach taken for data integrity.
Archive purpose
If the main purpose of the backup is to restore data damaged or lost due to some causes, the archiving of the data is to save the length of the data, or even permanently saved. Backup is relatively short, one day, one week, or a few years, but usually does not exceed 2-3 months, but also, backup usually requires usually not so frequent.
Archive definition
The so-called archiving refers to the data copy or packaging to make long-term historical preservation. One of its very important roles is to help the LAN system administrator delete files or information from the server disk, rather than access from the off-line memory when needed, not only fully and reasonably used disk resources, but also The security and integrity of this part of this data have been well protected, which can be described as two clothes. Archive operation
There are two main archive operations:
l Historical archive
l Capacity management
l Historical archive
As mentioned earlier, one reason for archiving is to save the length of the data. For a mechanism or a department, a large number of documents, including human resource information, various analytical reports, drawings, water images, etc., for some files may have no use value, but In terms of great documents, archives with certain use value or historical value and long-term preservation can be restored in urgent need, and their economic significance and historical significance are not suspicious.
Capacity management
Another reason for file archiving in addition to the valuable data for long-term storage, the other reason is that the data on the online system is too much, the free space of the server disk is getting more reduced, and when the disk is free to a certain extent, the operating system wants to disk It is likely to seriously affect the integrity of the data when you deposit data that exceeds its capacity. In the LAN environment, this may also affect data from multiple users.
One most common way to resolve server disk capacity issues is that the network administrator or system administrator will immediately delete files and directories that are considered not important when discovering the disk capacity problem. Fortunately, the backup system can recover it when an error occurs.
In order to retain disk space, it is not a new note. In fact, many people have already done it. The advantage of this approach is that the management tools of the file system are becoming more useful, and the functions are more powerful, however, the support required by the equipment and media cannot be provided.
Selection of archive files
Whenever you decide that the file archive is filed in the system, the question you need to consider is to archive which files, which is actually a strategy for archiving. Typically, there are four file system variables as the basis for selecting the archive file. These four variables are:
l
l Time (files from the final update)
l catalog
l is the main relationship
Based on the experience of the above four variables, it is better, of course, this is not absolute, sometimes using a simple policy will make management easier.
Four file system variable specific usage:
3.2. Method of archiving
Currently, there are a lot of products for archiving files, which is the system-oriented software, and some are applicable. Unfortunately, so far, although almost all document management software includes the modified file method, including archiving for historic archiving and capacity management, no product can include equipment support for tape or optical drive. . Therefore, the archiving function is limited, and can only move the selected file to a special directory to complete the archive task.
The usual archiving method is as follows:
l Document management
l Compressed archive
l Archive with backup system
Image system
The following is a detailed introduction to these methods.
Document management
The objects managed by the document management system are some already defined document groups on the network. Search methods used to grant keywords, text strings, file names, so power, even fuzzy language matching algorithms. It quickly looks for the required documentation.
When the document management system is archived with the file, first select some files that meet the archive criteria, which is usually time (the last backup time), then the document management system moves these files into the directory as the storage. Save until the file is written to the mobile media or is deleted.
The manager of the document management system should check this directory regularly, write the files into the moving medium, usually on the tape. One way to make archives of the document management system can be used, but this method is stored in two points:
No redundancy configuration in tape management
The operation process is a manual and automatic mixing, no integrity check, cannot guarantee that all files have been written to the tape before being deleted.
Compressed archive
The method of compressed archiving is a popular archiving method on the PC network. Its basic principle is to compress the data with the data compression tool, so that the disk space occupied is less, and then copy it to other places else or delete it.
The method of compressed archiving uses the Elementary File System Tool to identify and select files according to the size of the file, the last updated time and the owner of the file. After checking the selected file, the administrator uses the compression tool to compress the file and delete the original file. During the compression process, the compression tool combines files to compress a large compressed file, rather than a compressed file only corresponds to one original end compressed file. Usually this compression can save up to 70% disk space.
Methods using compressed archives require that the LAN system administrator must record all information including directory, tape, and compressed files.
Although this compressed archive method is more popular on the PC network, it is suitable for data management, but a fatal disadvantage of this method is easy to make mistakes. In addition, the operation process does not have a certain rule when using this method, causing many people's unfavorable factors, nor providing simple tools to let end users find those files required, and there is no mandatory redundant policy to ensure data. Will not be lost due to media damage.
Archive with backup
Archiving is another common archiving method using a backup system. In this way, archive first needs to be determined to archive those files, and then reserve the font to store those files on the movable medium, and finally delete those files. The biggest feature of this method is that it provides devices and media write operation functions that file management systems and compressed archives. Despite this, the backup system archive still does not establish a mechanism for media redundancy, unless the user does itself to do this. This can be implemented by multiplexing the same data multiple times on different tapes, for example, three different backup operations on three different tapes.
It is to be pointed out that the tape that archives with the backup system may be used in unintentional use in the backup operation. This is why redundancy copies are so important.
There are also features that provide some archive files in the LAN backup product, which are called disk modifications or file migration. But it does not provide a check of files on the media. Even if the backup system also has some process logs or databases to track files, this information is not always accessible to end users. Moreover, the log function or database in the backup system does not save for a long time. As a solution, it is best to back up the log file or database on the same tape when archiving files. You can even use a separate system to help track, archive files. This increases additional burden, but will be helpful when performing file recovery in the future.
Image system
In general, the image system is not suitable for archiving of large file systems, but it provides archiving for certain large-capacity applications. From the user positioning file, the image system and the document management system have their imagination, the main difference between the two is that the image system usually has integrated device support functions. Most of the image system products have an application package including the scooter device. The player type device allows the system to run efficiently without occupying the server disk storage capacity.
The image system is a good place for the paper-free office application. In these applications, the file is an electronically entered and stored in the record. Mechanisms that need to be tracked for a large number of account activity data, such as customs, commodity inspection departments, tax, insurance, etc., using the image system archive, can quickly and effectively store files that will be viewed in the future, without having to use old micro-film, etc. technology.
The image system has never been truly rested on the server from its data, and therefore, someone will say it is not a real file system archive, but it is indeed a method for accessing a large number of historic information on the network. The image system is an online system that is directly read directly from the optical medium, rather than being copied to the server to access. The image system is a more special application, which cannot be used in file system operations for general purpose. In order to store files into the image system, you must use the data input input technology provided by the image itself, such as paper scan, medical scan, or print image.
The purpose of the image system design is not to handle raw data, and its purpose is to store image data. From another aspect, the archive should be able to handle all useful data types, which also include image data. 3.3. Media and redundancy in archiving
For data that is archived in the medium, it is usually necessary to save a long period of time, so the media for storing archived data should be processed and maintained.
Medium storage
The media used by archiving data is mainly tape, which can affect the life of data on the tape, there are several factors:
Warm / hot temperatures
temperature
Electromagnetic radiation
Pollutants and smoke
For magnetic media that stores archive data, hot and humid weather is its natural enemy. If you want to save the data on the media, you should store your media in the best environment. Fa 40 degrees and less than 30% relative humidity is the ideal preservation environment, but it is very difficult to achieve it, even if it does this, the cost of paying this is very expensive.
Whether the media is stored in a strong electromagnetic field is not advisable. Radar systems and electric welders are the most severe high power electromagnetic noise sources. Both can completely destroy the data on the media. In the office environment, the cable near the elevator passage or any cable passes through the cable is not suitable for storage. Save tape in a solid metal or safe, even in the harsh environment such as electromagnetic radiation can also cause data from damage.
The tape that stores a particular long period of time is often fragile on the tape shelf. When using it, the magnetic powder is falling off, and the status of data loss is uncommon. In order to prevent such circumstances, the tape can be operated from the head to the tail in a clean tape drive, which can help keep the softness of the tape and prevent the fall of the magnetic powder.
redundancy
The purpose of redundancy is to ensure that there is enough media to prevent some media to damage due to accidents.
Establishing redundant data has two basic methods:
Multiple operation or copy media
Copy the data on the medium to other media after execution of the archive operation
In order to save data for a long time, in addition to strict compliance with media procedures, you need to use long-saved media to develop new copies to ensure that archive data can save longer.
3.4. Hierarchical Storage Management (HSM)
Hierarchical Storage Management (HSM) is an automated system that provides archiving functions, for users and managers, HSM is completely transparent.
The difference between hierarchical storage management and archiving is that HSM does not delete files, but a small file is retained in the original place, and this small file can be automatically used when the user needs to access the original file. The original file is coming back. Another difference HSM system uses "transfer" term instead of "archiving".
HSM functional components
HSM is a multi-level media system, in addition to the equipment and media in this system, it also needs to work correctly. HSM is based on the following three simple features:
l Automatic transfer
l Automatic resolution
l logo file
Let's take a detailed introduction to these three functional components:
Automatic transfer
Let's introduce "transfer" before talking automatically. The so-called transfer is to copy files to removable media and delete those files on the server. Obviously, the transfer is an operation of performing an archive function. In HSM, the transfer component also assumes the task of establishing a flag file. This method is that a set of parameters can determine when the file should be archived, and which files are archived. Automatic transfer typically creates a monitoring process by the system, which is used to determine whether the set storage capacity (disk capacity) is exceeded, or the system is set up to be triggered. If it is exceeded or triggered, the HSM system will interfere and start copying and deleting files. Typically, the set storage capacity threshold is between 85% and 95% disk capacity. Once the disk capacity exceeds a certain point, there may be such a problem: that is, the transfer will continue to perform and start moving the files that are originally saved on the online memory. Therefore, there should be a threshold capacity to determine when to stop the transfer operation. This threshold capacity is called a low water level sign. Typical low water level markers between 60% to 70% of the disk capacity.
Automatic call
What is automatic call? The automatic call is the meaning of recovering files from the HSM system when the user needs to access the flag file. The process requires the user to transparent, the sooner!
Automatic calling process is roughly as follows:
First, a mechanism that identifies the flag file is created in the system. This step is to achieve every open file by a difficult program, or can be implemented by modifying the file system so that the file system can identify the flag file.
Step 2, after the logo file is identified, some information is read and sent to the HSM system.
In the third step, when the HSM system obtains this information, decide what media uses the use of the file. The correct media is loaded, then the file is restored to the disk, and the flag icon file is overwritten by the original file.
The biggest problem with the automatic call system is that the time overhead of recovered files from the HSM system media is too large, far exceeds the user's expectations. The HSM system using a tape is slower than the HSM system using the optical medium.
Logo file
The logo file is also known as the placeholder file. It is built by HSM's unique components and replaces the original file. The logo file has the same file name as the original file, but is much smaller than the original file, and its contents contain information about the removable media in the original file.
At least the following two issues have not been resolved in the specific use flag file. First, the rename method of the logo file. Because new files with the same file name with the flag file will overwrite the flag file, making the previous version represented by the flag file unrecoverable. Now, the LAN-based HSM system is still unable to provide renamed features. Second, logo files can be moved from one server from one server when reorganizing, splits, or mergers, and can work properly after the flag file is transferred. For this issue, although the HSM software provides a transfer tool for the management logo file, there are also some very difficult issues to improve resolution.
Grading structure
In fact, HSM is a multi-stage medium system. A large amount of this system is that data can be transferred in different media. However, this structure of the medium is not as important as a flag file. Despite this, HSM acts as an archive method, if it does not introduce its related media strategy, then it is not integrity.
Two-level classification
One of the main ideas established by the HSM system two levels is that data transfer only requires a storage space, which is nearby. When the amount of data of the device reaches the device of the stored data, simply simply transfer the oldest media, and stored in a suitable environment, such as cool, dry place.
The two-stage classification system applies most of the LAN systems.
Level 3 grade
The three-level classification system is the most common in the HSM system. In this system, the transfer file is first stored on the near online device for quick call, then transfer to the linear medium. It is three-level classification, and the off-line system can be another type of automatic system like large tape libraries. This tape library that makes up the system as an optical scout device as a near-line storage can provide a slow set of automatic file call from the off-line storage subsystem. Near online media and equipment
Commonly used near-line media devices are: optical music type devices are: optical screaming equipment, and readable writable CD subsystems. One major disadvantage of these nearly online media devices is that the capacity of the storage data is generally small.
Tape can also be used as a near online storage device, but it takes a lot of time to load, and it takes more time to find data after loading. If you decide to use a tape as a near-line storage, you need to use a multi-drive automatic load machine to cope with the requirements of multi-user file call. In addition, it is recommended to adopt 4mm and 8mm devices because these devices are quickly obtained by the loading speed of the DLT. DLT has better maintenanceable capacity.
Near online storage can be optimized by using a disk.
Deline medium and equipment
The disengagement medium is usually referred to as a tape. In addition, the readable writable CD can also be used as a linear medium.
HSM work process
The working process of HSM is as follows:
First, the file is selected by the HSM system selection. These files are then copied to the HSM medium. When the file is correctly copied, a logo file with the same name is established. This flag file is much smaller than the disk space occupied by the original file. When the user needs to access this flag file, the HSM system recovers the original file from the correct HSM medium.
HSM and network structure
The HSM system is connected to a network in the network system, or one-to-one connections can also be set to transfer data from multiple servers. If it is a one-to-one connection, then the transfer of all data is done through the SCSI bus, and there is no additional impact on network data traffic. But this does not mean that such a one-to-one connection does not encounter network data traffic issues. Because HSM, it is necessary to complete its work through the network. For example, when the amount of data in the memory of the server reaches the high water level flag of the HSM system, the data file is started. At this time, the bandwidth of the network does not necessarily guarantee the rapid transfer of a large amount of data on the cable, and the result is very likely to cause the network. The system does not work properly and even makes the network. For this reason, when considering the HSM system implementation, it is necessary to consider the case where data transfer in the network system or router should be considered, and a large amount of data is transmitted in a bridge or router should be avoided to saturate it.
In order to avoid the occurrence of the above situation, a method for pre-processing on a transfer file is used in some HSM systems. The point of this approach is to predict which files need to be transferred first, then copy them in a non-busy working time to near online storage media. In this way, data that needs to be transferred during the transfer operation is reduced, and the main job at this time is to delete the file and establish and reserve the logo crisis.
Fault-tolerant technology is a strong means of constructing high reliability network systems and is also an active research field. This section briefly reviews the research history of reliability technology, summarizes the prospects of reliability technology.
4.1.1 Review of History
Performance, price and reliability are three major elements of a network system. In order to improve the high reliability of the network system, two methods have been summarized for long-term research. A system called a "perfect" system called to construct a "perfect" that does not contain a fault, and its means is to use the correct design and quality control to avoid the introduction of the fault to introduce the system. It is virtually impossible to do this. Once the system has failed, the impact of the fault is eliminated by detecting and verifying, and the system is automatically recovered automatically. The second method is called fault, so-called fault tolerance means that the system can perform a specified set of programs when the system has some specified hardware or software, or the program does not interrupt or modified due to the failure in the system. And the results do not include errors caused by the fault in the system. The basic idea of fault tolerance is to carefully design in the network system architecture, using redundant techniques of external resources to achieve the impact of masking failures, thereby automatically recovering the system or achieving a safe stop.
The research on fault-tolerant technology has begun very early. In 1952, Von.neuman has made five reports on fault tolerance theoretical research in the US Department of Science and Technology, and his brilliance has become the basis of future fault tolerance research. .
Initially, people can improve the fact that the three diodes in parallel with a single diode work can improve the fact that the reliability will be inspired and developed four times redundant lines; from the results of most component vote, this fact summarizes three models Yu and N mode redundant structure; the corrective code theory developed in communication is also quickly absorbed to improve the reliability of information transfer, storage, and operations. At the end of the 1960s, there was a sub-check, and the self-study calculation of Star represented by fault-tolerant computers, marking fault-tolerant techniques into a new period from theory and practice.
In the 1970s, the faulty technology is a flourishing period. The main success has a telephone switch system ESS series processor. The software implements a fault-tolerant SIFT computer, fault tolerance multiprocessor FTMP, and vote multiprocessor C.VMP.
In the 1980s, the era of VLSI and microcomputers rapidly developed and widely used, and the research of fault tolerant technology also went deep into the entire industry with the popularity of computers, and many corporate production of fault tolerant computers, such as Stratus fault tolerant machine series, IBM System88, Tandem 16 Waiting for commercialized and enters the market. It is widely believed that the times have arrived as an important feature of each digital system, and the result of the fault tolerant system has been developed by a single-machine distributed system.
4.1.2 Prospect
With the further development of computer network systems, network reliability is increasingly important. The main reasons are as follows:
The improvement of network system performance has increased the complexity of the system, and the speed of server frequency will lead to the system more easily errors. To this end, careful reliability is designed.
The environment of web applications is no longer limited to computer rooms, which makes the system more easily errors, so the system must have the ability to resist the evil environment.
The network has moved to society, and people used are no longer professionals. This requires that the system can allow various operational errors.
The hardware cost of the network system is growing, and the maintenance cost is relatively increased, and the system reliability is required to reduce maintenance costs.
Therefore, fault-tolerant technology will develop in the following directions:
As the complexity of the VLSI line increases, the fault burial depth increases, the chip fault tolerance will be delivered, and the dynamic redundancy technology will apply and produce with VLSI.
Due to the continuous development of the network system, the structure of the fault tolerant system will use the network's research success, injecting global management, parallel operation, autonomous control, redundancy and error processing in the network to study high performance, highly reliable distributed fault tolerant system way
More research will be conducted to software reliability technology
Analysis and experimental method in terms of fault-tolerant performance evaluation
A comprehensive methodology of a set of fault-tolerant systems will be proposed in theoretical research.
4.2. Classification of fault-tolerant systems
The ultimate goal of the fault-tolerant system directly affects the design principle and design options, so different fault-tolerant systems must be designed according to the difference in the application environment of the fault tolerant system.
The actual application of fault tolerance technology can divide the fault tolerant system into five different types.
4.2.1 High Availability System
The so-called availability refers to the probability that the system can operate at a certain time. High-availability systems typically face common calculations, perform a wide variety of user programs that require unpredictable. Because such systems are mainly for commercial markets, they make changes as little as possible. Hayming coded memory, bus parity, timeout counter, diagnosis, software legitimacy check, etc. is the main redundancy method, visible by words, the fault coverage of such systems is low, but in multiprocessor systems, faults Once discovered, it can be isolated so that the system continues to run or downgraded.
4.2.2 Long Life System
Long life system in its lifetime (usually in more than 5 years) cannot be used in artificial maintenance, commonly used in control systems such as spacecraft, satellite. The long life system is characterized by highly redundant, there is enough spare parts that can withstand the impact of the faults of many appearances, and the redundancy management can be automatically or remote.
4.2.3 Delay maintenance system
This system is closely related to the long life system, which can temporarily tolerate the failure of the failure before periodic maintenance, and maintain the survival of the system. This type of fault tolerant system is characterized by very difficult on-site maintenance, or expensive at cost, increasing redundancy is less than the cost payable at any time. For example, in the operation of the aircraft, ships, tanks, it is usually necessary to return to the base to be repaired.
Usually, in-vehicle, airborne, and carrier-load computer systems are delayed tolerance fault-tolerant computer systems.
4.2.4 High Performance Computing System
High-performance computing systems (such as signal processing machines) are sensitive to instantaneous failures (caused by tightening timing tolerance) and permanent failure (caused by complexity), to improve system performance, increase the average failure time and instantaneous time The automatic recovery capability of the fault must be fault tolerant. Examples of high performance computing systems such as CRY-1, SLAC, Dual 370/165, etc.
4.2.5 Key Task Computing System
The most stricty of fault-tolerant calculation is the most stricty in real-time application environments, where the wrong appearance may endanger people's lives or cause significant economic losses. In such systems, not only require the correctness of correct, but also the shortest time to recover from the fault, the implementation of the application system is not affected.
4.3. Fault tolerant system implementation method
There are several ways to achieve fault-tolerant systems based on the different tasks of the tasks and the investment capabilities that the user can withstand. Several methods commonly used are analyzed below.
4.3.1 Idle Spare Parts
"Free Spare Parts", which means that it means that a spare part is configured in the system. Indeed, it is a way to provide fault tolerance. When the original component fails, the idle spare parts are no longer "idle", which replaces the function of the original part. A simple example of this type of fault tolerance is to connect a slow printer to the system, but only the printer is used as a reinforce only when the printing system currently used.
4.3.2 Load Balancing
Load balancing is another way to provide fault tolerance. When the specific implementation uses the second part to assume a task, once in one of the components fail, the other component immediately took the task that was burdened by the original part, all of them. Load balancing methods are typically used in a server system in a dual power source. If a power is faulty, another power is assumed to be twice the load.
It is emphasized that only because the system is a dual power system does not mean that they are load balanced. For example, it is possible to power the storage device with another power supply with another power source.
Load balancing in the network system is symmetrical. In a symmetrical multiprocess, each processor in the system can perform any work in the system. This means that this system is made to maintain its load balance between different processors. For this reason, symmetrical multiprocessing can provide fault tolerance at the CPU level.
4.3.3 Mirror
Mirroring technology in a fault-tolerant system is a commonly used method of implementing a fault tolerance. In mirroring technology, the two components require exactly the same work, if one of them fails, the other system continues. Usually this method is used in a disk subsystem. Two disk controllers of the two disks are written to identical data on the same sector of the same model. Similar examples are typical examples: NetWare SFT III, Sentinel, etc. In mirroring techniques, two systems are required, and two systems complete a task. When the fault occurs, the system identifies it and switches to the single subsystem operating state.
It turns out that the mirroring technology can work well for disk systems, but if the mirror of the entire system is challenging. The reason is that there is certain difficulty in using mirroring techniques in the two machines and systems such as internal bus transmission and system failure.
4.3.4 Recount
Revision is also known as delayed mirror, which is a variant of mirroring technology. In retrofun technology, two systems require two systems: auxiliary systems and original systems, and the auxiliary system accepts data from the original system, and there is a certain delay in the reception of this data. When the original system fails, the auxiliary system takes place in the working system. Using this way users can restart their work in places close to the fault. The main difference between recovery and mirror is to start working. The main difference between the recovery and the mirror is that there is a certain time delay when the data established on the original system is copied to the auxiliary system. In other words, the recreation is not a precise mirror system, causing when working Interrupt is not negligible. Despite this, the reason for the use of recovery techniques in high availability systems can be designed to reproduce these systems, and these system can be designed to be replenished in high availability systems. The current process is running very quickly, and it can be done after the original system is written to the disk, which can reduce the loss of data on the network.
If you want to replace the original system, you must reproduce the security information and mechanism of the original system, including user ID, login initiation, username, and other authorization processes. These safety information and mechanisms mentioned above are not automatically satisfied in some of the resended products that are often sold, and network managers need to set these system security parameters manually to ensure that users can log in in full safe way. Of course, this affects the speed of the auxiliary system in the event of the original system.
4.3.5 Redundant System Accessories
Repeat some key components in the system enhance fault fault tolerance. The components that are repeatedly configured usually have the following:
l main processor
l power supply
L I / O equipment and channels
Measures to use redundant system accessories Some must take into account the system design, sometimes it can be added after the system is installed.
power supply
The network system uses a dual power system has been more common. This two power supply is load balance, which means that when the system works, they provide power for the system, and when one of them fails, another power supply The power supply of the entire system is automatically assumed to ensure the normal operation of the system. This must ensure that each power supply system has a power supply capacity of the entire load.
Typically, some of the other redundant components, such as NIC, I / O card, and the like may also be configured in a system with a dual power system. All of these increased redundant devices also consume additional power, while also has more enthusiasm. Therefore, you must consider the heat dissipation problem of the system, you must ensure that the system's ventilation is good to eliminate all the extra calories.
I / O equipment and channels
Transferring data from memory to disk or other storage media is a very complex process, and this process is very frequent. Therefore, the high failure of these storage devices is not surprising.
In order to prevent the data failure from failure, it will naturally think of using redundant devices and I / O controllers. Commonly used methods is to use redundant disk symmetrical mirroring and redundant disk symmetry. The former is connected to a single controller, and the latter is connected to the redundant controller. Double is more secure performance and processing speed than mirroring because the additional controller can be taken in the system's disk controller fails, and two controllers can simultaneously read the performance of the system. Main processor
In the network system, although the main processor does not fail frequently, once the failure of the primary processor has occurred, the entire network system will be not imagined. Therefore, in order to improve the high reliability of the system, adding redundant CPUs in the system is a good choice.
The challenge of the CPU redundancy configuration is exist in memory, buffers, and task management. The auxiliary CPU must be able to accurately track the operation of the original CPU and does not affect its operation. One method of implementation is to apply mirroring technology to the original processor in the auxiliary processor. If the original processor comes out, the auxiliary processor has installed the necessary information in the memory memory and can connect to the control of the system.
Symmetric multiprocessors provide a systematic fault tolerance to some extent. For example, in a dual CPU machine, if one of the CPUs have failed, the system can run on another CPU. However, the process running on the fault CPU may fail. Multi-process the ability of the operating system that is very dependent on the capacity of the storage management between tasks running on different processors. Therefore, one processor in a multi-processing system fails to cause the crash of the machine. If the system can continue to run in a CPU failure is a fault tolerance mechanism.
4.3.6 Redundancy of storage systems
The storage subsystem is the most prone to failure in the network system. Here is the most popular several ways to achieve redundancy of storage systems, which are:
l disk mirror
l Double
l raid
l disk mirror
Mirroring is common, and is also one of the most common way to implement storage systems. The format of the two disks formatted when the disk image is formatted, otherwise it will be troublesome. The partition size of the protagonist and auxiliary disk should be the same. If the partition size auxiliary disk of the primary disk, the mirror operation is no longer performed when the storage capacity of the primary disk reaches the auxiliary disk.
There are some additional performance overhead when the disk is written using disk mirroring technology. Only when two disks have completed the write operation of the mirror disk pair after the write operation of the same data, the time used is longer than the write secondary data of one disk. When using disk mirroring technology to read data operations, another disk can be positioned at the next data block to be read, so that it is much faster than the read operation with a disk drive, why is the wait head The time delay caused by the positioning is reduced.
Disk double
Add an I / O controller in mirror disk pair, referred to as a disk. It provides systematic performance due to the reduction in I / O bus contention. The I / O bus is substantially serial, not parallel, which means that each device on a bus is shared with other devices, and only one device can be written at one time. If each disk in the mirror disk pair, there is a controller, then the bus is striving for improvement.
RAID
RAID (Redundant Disk Array) is a storage system that can replace a disk or a faultless disk that does not experience any fault time, which is a way to ensure the non-fault time of the disk subsystem.
Another advantage of RAID is that the speed of transmitting data above is much higher than the speed of transmitting data on a single disk. That is, the data can be read quickly from RAID. This is because the bus is able to continuously transmit data from a single disk.
RAID level
There are many ways to implement redundant disk arrays, which depends entirely on performance, cost fees, and time required for non-defense. The RAID currently used is described in its level, with a total of 4 levels, they are:
L 0 RAID
L 1 RAID
L3 RAID
L 5 RAID
0 RAID is stripped on four disks without checking. The level 0 RAID system has no built-in redundant degree, which is usually used for the stability of the data is not very important, but it requires high-speed data transmission. The largest disadvantage of this level 0 RAID system is that any disk has failed, which will cause data on all disks. As an improvement, some products use a 0-level RAID system that is written in order to disk. This system When a disk is wrong, the data on the remaining three disks can be retained, but it is cost-effective. Level 1 RAID system is a disk mirror, and the data is not stripped. This type of RAID system has a higher cost because each disk has to be equipped with an additional disk as its redundant configuration. In addition, the write speed of this type of system is relatively slow, but there is a good readability.
The 3-level RAID system writes strip-shaped data between four disks, by dedicated check disks, the fifth disk written by the verification information, in such systems, if one of them is broken, A new disk can be inserted into the RAID slot, and then you can establish data on the new disk by calculating the remaining three disks and the data on the check disk.
The 5th RAID system is similar to the 3-level RAID, with 20% of the disclosure of redundancy, to achieve strip data on five disks, and the verification is also strip. In this system, any disk can be replaced, and the replacement data is created by the data on the remaining disk.
test
In the above-mentioned RAID implementation method, the remaining two have a check disk in addition to the 1st and 0 RAID systems. The redundant disk array system uses the Exclusibe or (videolius) algorithm to establish check information written to the disk. It is done by hardware chips instead of processing storage space. Therefore, it has a fairly fast calculation speed.
The main function of the verification is to reconstruct the data on the fault disk by using the check reconstruction algorithm when the system is replaced when a disk is faulty in the system.
The RAID controller uses a check-in method to reconstruct the lost data on a new replacement disk inserted into the RAID slot. This method is called the check.
Calibration Reconstruction is a quite complex process, and the reconstruction process needs to remember that it has been reconstructed when it is interrupted, remembering that these disks are synchronous, and the write operation must be coordinated. If there is a new data that needs to be updated to the disk, the situation will become more complicated. The check rebuild will result in a large decline in system performance at the beginning of the rebuild.
Device replacement
RAID system provides two ways to replace equipment:
l hot replacement
l hot sharing
Hot replacement means that when a redundant disk array access system provides a disk I / O functionality to the system, the ability to insert or unplug the device from its slot. The hot shared device refers to an additional drive in the slot of the RAID system, which can be automatically inserted into the RAID array when any disks fails. Such devices are often used in the RAID slot of multiple RAID arrays.
RAID controller
The redundant disk array system is a system consisting of multiple disks, however, from the I / O controller of the host, the RAID system seems to be a disk. There is also another controller in the RAID system, which is a component that truly executes all disk I / O functions, which is responsible for a variety of operations, including writing operations to rebuild the verification information and check reconstruction. Many functions of the RAID system are determined by this control.
Redundant RAID controllers provide fault tolerance and provide fault tolerance for redundant disk arrays.
4.4. Network redundancy
Both the network redundancy as a transmission data medium, and other network connectors must have an alternate pathway that continuously runs. This section will discuss the way to improve the reliability of the backbone and network interconnection equipment.
4.4.1 Redundancy of the main network
The topology of the backbone should consider fault tolerance. Net-shaped bamboo trees, dual-core switches, redundant wiring inter-wiring, etc., all guarantees that there is no single point of failure in the network.
Dual
The trunk is used to connect the server or other service devices on the network. Typically, these trunks have a higher network speed to make the server achieve better performance. Therefore, when the network is provided for the server, if it has failed, even if the server can run, it is actually not used because the access is cut off. This is why the use of double trunk network. In a network system of a double trunk network, if the original network fails, the auxiliary network will assume the service of data data transmission. The concept of dual main dry is independent of the network topology, which helps to achieve token - Ring, Ethernet, FDDI.
Double-tricks come to the auxiliary network to lay the line along the original network when specific implementation.
4.4.2 Switch Control Equipment
In a network system, a hub, a concentrator or a switching device. In the 10BASE-T and ATM network systems controlled by the switch, each machine and network connection is achieved by some switching devices. In these networks, network redundancy can be established by providing auxiliary high-speed connection between the devices. Such network devices can accurately detect the capabilities of failed segments, as well as the available auxiliary paths to share data traffic.
Network switch control technology can be managed through the network management program. This means that it can be displayed on the interface of the control program when the part of the network is faulty in the network and responds quickly. In addition, switch control can be discovered in advance to discover the failed network segment by analyzing data traffic or bit error rate. Once the data traffic is found or the error rate exceeds a certain value, it is possible to know that a certain network segment will happen.
Typically, both network switch control devices are designed as a modular heat replaceable circuit board plug-in. This design is the advantage of finding the chip on a board in the device. It can immediately use the new circuit board to replace it. .
Dual power supply and battery reserve, such as switching control devices, can prolong the network-free time.
4.4.3 Router
Routers are one of the most flexible network connection devices in the network system. It indicates the direction of the flow of data in the network. At present, most of the exchange routers are used in the network system, and the performance of this router is 10 to 100 times the ordinary router, and the price is only about one tenth of the latter.
The swap router VRRP (Virtual redundant router) and an OSPF protocol are used for 2 switched routers to back up each other, which are used to pass the connection of the fault.
In addition, the switched router guarantees the time-sensitive application (its data stream is generally high priority) by complex queue management mechanism (its data stream is generally high priority). A good queue management mechanism can also perform flow control and flow shaping to ensure that the data stream does not congest switch, and the smooth data stream output. Another function of the swap router is to dynamically reserve the required bandwidth and control the application layer information flow through the RSVP (Resource Reservation Protocol), you can distinguish between different information flows and provide service quality. Guarantee.
In the network system, if the server will fail to start the server or backup center in the standby room, how can users access the server that replaces the location? In the case where there is no direct network connection between the user equipment and the server, you can establish a server that is connected to these in a new location by changing the settings of the router. In extreme cases, the user equipment can also be moved, with the help of the telephone and network service provider, add the router at the new site to establish a temporary network that assumes the data traffic between users and servers.
4.4.4 PIPES software
The network redundancy described above is implemented by hard devices. Using software to implement network redundancy although it is not common, it is a choice.
The PIPES software produced by US PeerLogic can bypass the line in the network, and transmit data as the user through other network connections. The Pipes network shares a directory service mechanism that recognizes all possible routes between all the machines running Pipes in the network. The software has a smart error control function that allows PIPES to dynamically and transparently use other route to maintain network communication when the original route is problematic. Such routes include the path to use different communication protocols. It should be noted that PIPES is not designed to be used on each machine on the network, which is generally used to develop distributed applications on a local development platform that provides redundant and fault tolerance services. Therefore, some plans and development tools are needed for use using PIPES.
"Data" is a wealth this point in today's information society. The failure of the database often leads to the paralysis of an institution. However, unfortunately, any database system is always impossible to fail. There are two ways to deal with the failure of the database system: one is to improve the reliability of the system as much as possible; another way is to restore the database to the original state after the system fails. Just just the first point is not enough, there must be a second way, that is, there must be technologies that must restore the original state after the database has failed.
5.1. Database Backup Evaluation
Database If a failure may result in loss of data, the lost data is to be restored, and the database system must be backed up. Prior to this, a comprehensive assessment of a backup of the database is necessary.
5.1.1 Characteristics of Database
The database in the network system is different from other applications on the network. The following is a simple introduction to some of the characteristics of the database to facilitate backups of the database.
Multi-user
The server in the network system is used to share resources, however, most files stored in the server are used to access single users. However, the database on the network system is also accessible to multiple users. This means any management operation of the database, including backup, which affects the user's work efficiency, but not only one user is more efficient.
High reliability
A network system database has a feature of high reliability. Because multi-user database requirements have a longer access and update time to complete batch task or access to users of other time zones.
The so-called "backup window" mentioned in the database backup refers to the period of time between the two working time periods, during which the database can be backed up during this time, and within the rest of the period, the database Can't be backed up. It is usually considered to schedule this time in the "quiet" state, at this time, LAN does not do anything, and the file is turned off, so it can be backed up without interfering with the user.
Frequent update
The continuous update of data in the database system is the return of the database. In general, the file server does not have too much disk write operation. However, due to multi-user, the database system is much larger than the file server per second.
Large file
The database is typically more than the data that needs backup data and shorter time for backup. Alternatively, if the backup operation exceeds the backup window, it will also cause more problems in user access and system performance, as this database is to respond to more requests. 5.1.2 Evaluation of Backup Scheme
The assessment of the database backup scheme mainly refers to the analysis of the following issues before formulating a database backup solution, making an assessment on the basis of the analysis:
l Content for database protection
l The assessment of its loss must be made to the data
l Evaluation of the cost required for backup
l Evaluation of the cost required for backup
Although "data" is a wealth, the database runs to a mechanism to bring great help and benefits, but the cost of different backup protection levels must be weighed when making a database. If the data is 10,000 yuan, you can re-get it, and you may have lost data three years, then if you need to spend 5,000 yuan every year, it is meaningless. Therefore, before the database backup, you need to consider the following fees and risk issues: Can the cost can afford it? If you can't afford it, you need other ways to afford it.
Can the measures adopted to improve the status quo?
Do other problems during the implementation of the measures employed? This includes what it will affect when using the system? And whether it leads to a decrease in work efficiency? and many more
Does this measures value for? What will be lost in the worst case?
Technical assessment
Database backup is usually an either either either unablething - if you do not back up the entire database, you cannot use it after it is restored to the system. For most database systems, any changes to the database need to be fully backed up throughout the database. Therefore, it is necessary to evaluate the technique of backup before the database backup.
The biggest problem when making backups for the database is a backup file, as doing so may cause its backup copy to lose data integrity.
In the main characteristics of the online database previously discussed, there are two characteristics that are frequent updates and accessibility when they need. In order to improve the functions of these features, the database system is required to keep its database file to be opened when running. This means that the database file will be updated during the database backup.
The update in the database backup has the following cases:
l Update the file that has been copied
During the backup process, the file has a database update, which occurs after the backup process has been copied, that is, the update occurs in that area where the file has been copied, no other part of the file, The backup file is still complete, once the system needs to recover, the file can still be restored to its original state. Re-enter the update that occurs after the backup starts to restore the database to the status of the previous time.
l Update occurred in the file unsold
This type of update will not be a problem. If the database needs to be recovered, the database file will be restored to a complete state of the update that includes B. If the status of the database is restored to the previous moment of the fault, you need to enter those updates that occur after the end of the backup.
l Different different states of file updates
Since the file's backup copy includes a state of the information that is not changed at a point and the information of the B point is changed. The reserve copy of the database file has now lost integrity. When this happens, the relevant data may become meaningless, or even cause the database system to crash.
l cold backup
The so-called cold backup is a downlink backup. Although the updates discussed earlier cannot be written in the database file, it is still meaningful, but it is not a good way after all. In order to prevent the best way to happen, turn off the database before starting the backup, is a cold backup.
Cold backups are usually carried out when the system is unmanned. The best way to cool backup is to create a batch file that closes the database first, then back up the database file, and then start the database.
5.2. Type of database backup
The commonly used methods of database backups are as follows:
l cold backup
l hot backup
l Logic backup
5.2.1 cold backup
The supernatant backup has been discussed in the previous section of this chapter. Cold backup is idea to turn off the database system, backup in the case where there is no need to access it. This method is the best in maintaining the integrity of the data. However, if the database is too large, it cannot be completed in the backup window. At this time, other applicable methods should be considered.
5.2.2 hot backup
When the database is running, the update may also be written to the database to refer to the hot backup. The hot backup of the database relies on the system's log file. When the backup is performed, the log file will require an update or change command "stack", not to truly write any data to the database record. When these updated services are piled up, the database is actually not updated, so the database can be completely backed up. A fatal disadvantage of the thermal backup method is to have great risk. There are three reasons: First, if the system crashes when the backup is backed up, all services that are stacks in the log file will be lost, that is, the loss of data is lost. Second, when the thermal spare is performed, the database administrator (DBA) is required to carefully monitor system resources, ensuring that the storage space will not be occupied by the log files, which cannot be accepted. Finally, log files also need to be backed up to a certain extent to rebuild data, which requires other files and coordinating their database files to increase complexity.
5.2.3 Logic Backup
The so-called logic backup is to use software technology to extract data from the database and write the structure into an output file. This output file is not a database table, but a image of all data in the table. In most client / server structural mode, the structured query language (SQL) is used to establish an output file. The process is slower, and it is not practical for the full-scale backup of large databases. However, this method is suitable for use in incremental backups, that is, the data that changes after the last backup is last backup, does not lose a good choice.
Recovering data using logic backups must generate an inverse SOL statement. Although this process is very time consuming, the time overhead is large, but the work is very satisfied!
5.3. Database backup performance
The performance of the database backup can be used to illustrate its good or bad, the two parameters are the amount of data copied to the tape and the time spent on the work. There is a difficult contradiction between data and time overhead. If all the data in the backup window is transferred to the tape, there is no problem. If you cannot back up all the data in the backup window, it will face a very serious problem.
Typically, there are several ways to improve database backup performance:
Upgrade Database Management System
Using faster backup devices
Back up to disk. The disk can be on the same system or on another system of the LAN. If you can specify a complete capacity or server as a backup disk, this method is best.
Use a local backup device. When using this method, the SCSI interface adapter card can be guaranteed to undertake high-speed extended data transfer. In addition, the backup device should be connected to a separate SCSI interface.
Use the original disk partition backup. Read the data directly from the disk partition, not the crisis system API call, can speed up the execution of the backup
In addition to the above methods, St.Bemard Software's Open File Manager products can also improve the performance of the database backup. This product can identify the ability to identify the backup process by NLM or specify a single user ID and password. It can capture them immediately before the disk data write occurs, and the old data is written to a disk cache area before allowing the update to overwrite the old data. Then, when the backup process encounters this such disk block, the information of the disk block is read from the cache area instead of reading from the disk. With this way, the integrity is not lost when the open file is backed up, so that it keeps it at T = 0.
Open File Manager also has the ability to connect several small database files and use similar methods to update their update to buffers. With this method, it can ensure that they are backed up as a separate database file, thus having all data integrity.
5.4. System and Network Integrity
Protecting the integrity of the database, in addition to the techniques that have been discussed in the previously improved performance, can also be implemented through the high reliability of the system and the network.
5.4.1 Server Protection Server is the primary machine on the LAN. If the integrity of the network database is protected, it must be protected by the server. The way to protect the server includes:
Power adjustment To ensure that the server can run a long enough time to complete the database backup
Environmental management, the server should be placed in air-conditioned rooms, vents and management should be clean, and regularly check and clean up.
The room where the server is located should strengthen safety management
Work-up hardware replacement work in the server, thereby increasing the reliability of hardware in the server
Try to use auxiliary server to provide real-time failure
Copying by image technology or any other form to provide some degree of fault tolerance. The system that receives replication data should have the ability to replace its online work after the failure of the original system. This type of scheme can reduce the loss of the network database after the system failure. However, this scheme does not apply to the mid-event failure at the primary update.
5.4.2 Protection of the client
For the integrity of the database, the protection of the client or workstation is as important as the server. The protection of the client can be conducted from the following aspects:
Power adjustment to ensure the power supply required by the client normally
Configure the battery backup to ensure that the client can continue to run until the file is saved and completed the business.
Regularly replace the hardware of the client or workstation
5.4.3 Network connection
The network connection is a cable, hub, router, or other similar device in the server and a workstation or client. To this end, the cable installation should have a professional level, and the accessories should be guaranteed, and the network management tool is also required to monitor data transmission through the network. In addition, the power conditioning apparatus including the battery backup should also be used for all network connection components. If possible, you should design an auxiliary network connection path, that is, network redundant paths, such as dual backgoge, or switch control connection so that the network connection fault can be quickly reacting and re-establishing the user. 5.5. Recovery of database
Database system dealing with faults is nothing more than two measures:
l to improve system reliability as much as possible
l After the system fails, restore the database to its original state.
It is far less than the reliability of the system. Because any system, no matter how high it is, the occurrence of the fault is always inevitable. After the system failure occurs, restore the database to the original state, that is, recovery technology.
5.5.1 Types of Recovery Technology
Restore technology can be roughly divided into three types:
l Simply backup-based recovery technology
l Restore technology based on backup and running logs
l Restore technology based on multi-backup
Simply backup-based recovery technology
Simply converted by file system recovery techniques, ie, the data on the disk is periodically copied or dumped onto the tape. Since the tape is stored offline, there is no impact on it. When the database is invalid, it is possible to recover the database from the disk backup from the disk to the tape, that is, copy the database on the backup tape to the position where the disk's origin is located. With this approach, the database can only be restored to the primary state of the recent backed up, and the update of all databases during recent backup to failure will be lost. This means that the longer the backup period, the more lost update data.
The data in the database is generally only partially updated, rarely updated. If only the physical block of its updated is dumped, the amount of dumps will be significantly reduced, nor does it have to use too much time to dump. If the frequency of the dump is increased, the loss of data that has been updated when the fault can be reduced. This type of dump is called incremental dump (Id).
The recovery technology that uses incremental dumps is highly simple, nor does it increase the overhead of the database normally run, and its maximum disadvantage cannot be restored to the recent consistent state of the database. This recovery technology only applies to small and less important database systems.
Restore technology based on backup and log
The system runs the log for recording the situation of the database, generally includes three contents:
l Before image, referred to as a referred to as an image (AI)
l transaction status
Avatar
The so-called front image refers to the inforph of the physical block updated when the database is updated by a transaction, which is based in physical blocks. The role of the primary in recovery is to help the database restore the status before the update, that is, withdraw the update, this operation is undo (UNDO).
Backbone
Imagine the opposite of the front image, it is the image before the database is updated when the database is updated by a transaction, and its units and avatars are in physical blocks. The role of the back is helpful to restore the database to the updated state, which is equivalent to reintegration. This operation is called redo (REDO) in recovery technology.
Transaction status
The transaction status of the log is logged to perform the status of each transaction to perform different processes when the database is restored.
The transaction is increased, which means that the transaction has been successfully implemented, and the transaction can be accessed by other transactions.
The transaction failed, it is necessary to eliminate the influence of the transaction on the database, the processing of this transaction is called Rollback (ROLLBACK)
This recovery technology based on backup and log is, when the database is faded, the recent backup can be taken, and then according to the log record, the front image of the unsubted transaction is retrofitted back, this is called backward recovery; The transaction submitted, if necessary, uses the back like red, called forward recovery.
The disadvantage of this recovery technology is that because there is a need to maintain a running record, it takes a large storage space and affects the performance of the database. Its advantage allows the database to recover to the nearest state. Most database management systems also support this recovery technology.
Multi-backup recovery technology
Multi-backup recovery technology is that every backup must have an independent failure mode, which can be used to recover each other with these backups. The so-called independent failure mode refers to failure of each backup is not expired by the same failure. An important element for obtaining independent failure modes is the support environment of each backup as independent, including non-shared power, disks, controllers, and CPUs. In partially reliable requirements, disk mirroring technology, that is, the database is stored in two separate disk systems in the form of a dual backup. In order to make the failure mode independent, the two disk systems have their own controllers and CPUs, but This can be switched to each other. When reading, you can read any of the disks; when writing data, both disks are written to the same content. When data in a disk is lost, the data of another disk can be recovered.
Based on multi-backup recovery techniques are used in distributed database systems, this is completely disposed of performance or other considerations, with data backups on different nodes, and these data backups are different due to their contacts. The mode is also independent.
5.5.2 Easy to update recovery technology
Each relationship has a page table, each of the page table is a pointer, pointing each page in the relationship (block). When updating, the old page remains unchanged, and find a new page to write new content. When submitting, the pointer to the page table is bullive from the old page, which is updated to update the page pointer. The old pages actually play an avatar. Since the storage medium may fail, it is still required. The old page is also called Shadow.
Before the transaction is submitted, other transactions can only access the old pages; other transactions can be accessed on the transaction. If the transaction fails during the execution, the fault occurs before the submission, the database status is BI; the fault occurs after the submission, then the database status is AI. Obviously, this naturally meets the consistency requirements of the data. When the database is damaged, it is necessary to do backup and AI. Recovery measures are not required when the database is not damaged. Easy to update recovery technology has the following limitations and disadvantages
At the same time, one file only allows a transaction to update it.
When submitting, the main record is generally limited to one page, and the number of documents is limited by the main record size.
The size of the file is limited by the size of the page, and the size of the page table is limited by the buffer size.
When it is easy to update, the file is difficult to connect into a piece.
Therefore, Easy Update Recovery Technology is generally used in a small database system, which is not applicable to large database systems.
5.5.3 Types of failure and the countermeasures of recovery
A recovery ability of a recovery method is always limited. Generally, it is often only valid for a certain type of failure, and the recovery method for applicable is not available. A backup is required in the aforementioned recovery method. If the backup is damaged due to irresistible factors, then the recovery method described later will be powerless. The usual recovery method is a high probability, which can be divided into three categories.
Transaction failure
Reasons for the failure of affairs:
l The transaction cannot be executed and
l Operation error or change ideas to request to undo transaction
l Some transactions are suspended due to system scheduling
The failure of transaction is often happening, which must occur before the transaction is submitted, and it is impossible to submit it even if the cancel is canceled. The following measures are taken in the failure of the transaction to recover:
l Message Management Message Queue for Discard Transactions
l If you need to cancel
l Delete the transaction ID of the matter from an active transaction list, release the resource occupied by the transaction
System failure
The system referred to here includes an operating system and a database management system. The system crashes, must be restarted, the data in the memory may be lost, and the data in the database is not broken. The reason for the failure of the system is:
l Power down
l Hard, software failure except data inventory storage media
l Restart the operating system and database management system
l When restoring the database to a consistent state, the undiscited transaction has been submitted, and the transaction has been submitted to REDO operation.
Media failure
Medium failure refers to disk failure, database damage, such as the disk, and damage to the head.
In modern DBMS, there is generally provided a measure of the recovery database to the recent consistent state for media failure. The specific process is as follows:
l Repair system, replace the disk if necessary
l If the system crashes, restart the system
l Loading the nearest backup
l Redo it with the post-running image, take all the transactions submitted after the recent backup
The cost of recovering the database from the media failure is high, and the logging of the log is required to provide the backbone of all transactions, the workload is very large. However, these costs must be paid in order to ensure the safety of data.
Any computer system or computer network system does not have a threat to every natural disaster or a human disaster, especially in large-scale environmental threats such as earthquakes, fires, and mad storms that can destroy the entire building. After the disaster, the recovery of computer networks, for a professional system administrator or network administrator, can be encountered by one of the most challenging operations, after the disaster, may not return to the original daily work Maybe there may be no management tools used by daily work, and may even start with no assistant. All such, requesting post-disaster recovery work, this difficult extent is difficult to imagine, and how it is difficult to predict. Therefore, it is the key to the success of the success in the preparation of it. This chapter only provides an outline of information that network users need to use in the disaster recovery plan, and can be used as a model of the plan. With these basic knowledge, you can arrange time and resources to prepare a wide range of disaster recovery plans for content.
6.1. Preparation for Disaster Recovery Plan
The disaster recovery plan is to determine what the computer network system is subject to catastrophic strike or destruction, and must do something when recovering the network system. Therefore, we must carefully consider how to recover the network at the fastest speed after such disasters, and reduce the loss of disasters as minimal as possible.
Develop a disaster recovery program is a very important thing for any network users. Unfortunately, the disaster recovery plan is often placed in the same side. Some LAN users have not considered a need to develop such a disaster recovery program, and the meaning of it does not understand or not understand.
One of the problems with a network disaster recovery program is to point out where to start, which is also a principle of planning a plan.
6.1.1 Planning from the worst case
The degree of damage caused by disasters to computer network systems is unable to estimate it. When developing a disaster recovery plan, we should consider the worst case, and consider some of the cases of the network system to consider, in order to arrange time and make full use of existing resources to create a widely available disaster recovery plan. .
6.1.2 Take advantage of existing resources
In a variety of resources, human resources are undoubtedly the most valuable. The development of computer network disaster recovery plans In addition to leaders and network administrators of the competent network, there should be people who have knowledge of reconstruction real estate and service arrangements. Their work is not necessarily for the recovery of computer network systems, they can be assigned Research and planning to start the task of work after the disaster. If you can take advantage of these resources in the organization, you can save a lot of time in your disaster recovery plan.
6.1.3 Executive Plan
Typically, a network system may take several years to design and establish, suddenly refunding the network during a few days or weeks, talking about! This requires programs that guarantee all skills and detailed organization arrangements that succeed.
Recovery from disasters should be a group's collective behavior, but unless all group members clearly know that the principles and processing processes are clear, the group's work is extremely difficult, no matter what action in disaster recovery, it should be clear and clear Unimportant to anyone who needs to understand. Regular changes and disaster recovery programs that are not effective in effective communication with others are almost as bad as completely.
A pre-prepared text can help avoid the pressure brought by others, and also from the suspicion and blame of others. When catastrophic events suddenly occur, according to the established guidelines Reconstruct the network system from the ruins.
6.2. Disaster recovery methodology
The basis of disaster recovery methodology is to understand and master the storage of backups and data from the network system, so you can consider what you need to do when the disaster occurs and how to do it.
There are several different analytical methods for disaster recovery methods, but the basic principles are the same, and its main points are as follows:
l risk analysis
l risk assessment
l Application priority
l Establishment of recovery requirements to establish text
l Program testing and implementation
l Program distribution and maintenance
6.3. Disaster recovery plan
The ultimate goal of the disaster recovery program is to produce a "disaster recovery plan file" that can be implemented after a disaster. Such a plan file has a considerable cohesive action, by giving a list of executions that must be in accordance with each member of the recovery team, make each of the recovery groups to play their intelligence, and put the recovery work Decreated.
6.3.1 Backup of data
Pre-preparation of disasters should begin with data guaranteed to be recovered. Although, in a disaster recovery plan does not necessarily use a backup operation as part of it, a good and reliable backup operation should be a prerequisite for the disaster recovery plan. Otherwise, talk about recovery is waste time. Therefore, when considering the disaster recovery plan, you should be able to prepare some things happening in case of disaster prevention:
l A backup operation every day and check the integrity of the backup
l Regional offline conversion of tapes to ensure recoverability when situ disaster occurs
l Understand, familiar, master how to use the backup system for data recovery
It is well known that sometimes it may be wrong in the backup operation of the data. If there is a data loss or a backup that does not write to the media is not available for the recovery of the system. Therefore, checkup needs to be verified to ensure completeness of the backup.
Regularly perform offline conversion to backup media to avoid all of the disasters from all losses. Establishing a rule that regularly transfers tape to offline location ensures that data can be accessed when an emergency occurs.
6.3.2 Risk Analysis
According to the disaster recovery methodology, risk analysis is the first step in the disaster recovery program. The central content of risk analysis includes the following three aspects:
l What is the risk
l What will happen?
l What is the possibility of occurrence?
What is the risk
What is risk in the disaster? This question is easier, but it is not so simple to list it as part of the disaster recovery plan. It is necessary to consider the components in its network system, including servers, workstations, or clients, data, And communication devices that are linked to the outside world, etc. The structural diagram of all components in the network system can help create a list of items that need to be replaced after the disaster. To remember that the software also needs to be replaced, and you need to identify all software products, including those file system tools for network operations.
One of the missions in the list of items that need to be replaced will easily make the recovery contribution after the disaster cannot be carried out. For example, if you do not connect the serial line of the modem, you cannot work remotely.
People, especially the risk of key characters is particularly important, but they are often forgotten. Key characters If they are injured in disasters, other people have to exercise their responsibilities. Therefore, cross-training can reduce the impact that a key person cannot participate in the restoration of work.
In addition, the manual for each outlet prepares a one of the most important applications is necessary.
What will I have problems?
"What will I do?" This is a fun issue in the disaster recovery plan. Answers for this issue can be from direct to almost unable to believe. The world, unpredictable disasters may occur. Floods, floods, storms, etc. are quite common. There should be a prevention, practical countermeasure for this. For example, when the fire, the heat of the fire, the smoke and the automatic sprinkler system is injected, and the water injected from the fire is malignant to the computer network system. The storage medium is easily destroyed by high temperatures and smoke. The cleaning of the toxic residue after the fire will continue to take a period of time, which means that there is no possibility to contact the system and data over a point. As a countermeasure, some well-trained experts can wear protective clothing into the building, retrieve the data out of the device, and try to recover data from the disk.
Unfortunately, artificial mistakes and human intentional damage may be the most likely causes of data loss or destruction. If this error is similar to the network system, it should be taken seriously as other kinds of disasters.
What happened?
For this question, you need to make some financial budget considerations. Several different budgets for different levels of protection and preparation estimates are beneficial. If it is not possible to pay for the costs required for certain threats, it is also possible to know what these threats are, so it can be improved in the future. 6.3.3 Risk Assessment
The risk assessment referred to herein should be understood as a commercial loss due to the damage of network services. Typically, the loss caused by disasters can decompose as follows:
l Replace the actual cost of network system equipment
l Production loss
l chance loss
l credibility loss
The actual loss of system equipment and software is relatively easy to calculate, and it can calculate its corresponding value according to a list of network systems.
Production loss due to the loss of the network system can be calculated using the relevant information of previous production.
The assessment of the chance loss includes two aspects: one, due to the loss of sales income due to the fault of the network, its second, market institution's income loss.
Retribution loss is invisible to use specific numbers to scale, but it also has to be included in the assessment.
6.3.4 Application Priority
After the disaster occurred, start recovery the system, first requiring the recovered application should be the most urgent part of the production and operation, do not waste energy and time on the system and data of the incorrect. Typically, a mechanism has a number of departments, there are always multiple applications, and all departments always list the application related to their own sectors as "most important", but actually not these application systems, so senior management personnel It is very important to help determine the order in which system recovery is, it is also necessary.
The disaster recovery plan should include the list of the order of the system-recovery, and it should be signed by senior executives to reduce arguments.
After knowing what you need to recover, you should check the things you need to restore your functionality. The application on the network consists of some service systems, with the application stores its data, the workstation system is handled, the printer or fax machine is used for I / O, and the network connection part connects these things together, and application software. This adds additional complexity due to the network structure using client / server mode or distributed applications.
When the network administrator determines the limited level of the application with senior executives, it has to determine the number of workstations that make the system available to the minimum number of workstations. After the system is running normally, it will gradually increase the scale of the network.
One benefit of performing application recovery is that less time required to recover an application compared to the entire server. However, this approach requires more understanding of the system more than current. First, you need to know what the data you want to use is there, as well as what is dependent on the file system. If some system files containing application information, such as Windows .ini files, it is necessary to ensure that these files are also recovered with the application. Second, you have to know how to use the backup system to make this type of selective recovery.
In order to reduce the time required to start and run the work network, these applications can be merged into a separate server may be the fastest.
6.3.5 Establishing recovery requirements
The core of establishing recovery requirements in the disaster recovery plan is to decide to make the function network to re-run the acceptable and time length, the so-called recovery time object (Recovery Time Objective, abbreviation RTO). The RTO determined should be tested to ensure that it is actually possible. Different applications have different RTOs.
6.3.6 Generation of actual disaster recovery text
Disaster recovery plan
The main contents of the network disaster recovery plan are:
L personnel notice, phone number, map, and address
l Priority level, responsibility, relationship, and process
l Get and purchase information
l Network diagram
l system, configuration, and tape backup
To ensure that you should notify us when the disaster occurs. For example, if a fire has occurred, call the fire brigade first. If there is no latest phone number or address, it is difficult to contact the people you need to find. Maps showing the location of the temporary operation center and offline facilities save a lot of time. It is also useful to display the spare route to prevent the original route.
Considering how to respond to disasters, concentrate on the established priority. You should immediately start to restore the best application. Should give people a clear instructions and responsibilities. The relationship between tasks forms text so that any bottlenecks existing. Finally, the detailed operation steps and tasks of accurate installation and recovery operations should be included therein.
What to get and purchase is what you should know how to issue a purchase order and transport the device to a temporary operation center. This means providing vendors' address and required transportation methods.
The network schematic greatly simplifies the task of establishing a network. Detailed webmaps for the first family you can help build the network quickly and put into operation. Give some cables to labels to reduce much more chaos in the future.
If you can store some replacement systems that can deal with different tasks, you can make the recovery operation to win the time. After the replacement system starts running, restore the system to its original configuration according to the configuration information obtained in the delivery report.
You need to make sure that the available tape backup system, if possible, the offline saves a standby system, including the SCSI adapter, cable, and device driver software. If you need to upgrade your work backup, then upgrade the offline system, otherwise you may encounter an incompatible tape format, database, or other issues that cannot restore data.
Management support
One of the issues of network environment disaster recovery plans is that computer network technology is very fast, which includes new equipment and new application systems. Therefore, the plan that has been written should be adjusted or updated regularly, such as once a year.
6.3.7 Testing and adoption of the plan
For tests that have been prepared, you can prove the plan to yourself is really feasible. If you can completely open your ideas, you will be better. The test of the program is to find out if there is any problem, not just whether the verification plan is feasible. If you have an error, you will write it down and modify it.
Tests to plan should be done according to the content distribution of the plan. For testing on phone numbers, addresses, purchase information, etc., is more simple, but testing to recovery data is not that simple.
Test the backup software to see if the high-priority application is restored as homing manner. This should be carried out on separate, isolated networks to avoid server license conflicts. Once the data is recovered, test it, see if the user can access it, this requires multiple workstations to connect to the network, simulating the true end users with an account on the original server. At this time, you may need to modify the plan to create the management information of the end user to enter. Test each operation in the plan, and then test whether it can get the same result in the work network system.
6.3.8 Program distribution and maintenance
When a plan is tested and proved to be available, it needs to be distributed to those who need it. To try to control the scheduled release, make sure that the plan will have multiple versions. In addition, it is also necessary to ensure that the planned extra copy is stored in the offline station facilities or other places near the working place. A list of all personnel locations of the program also needs to be retained. When the program is updated, the previous version of these plans is replaced and retracted.
The planned maintenance is easier, including what is the most important thing to re-evaluate the application system and determine the application system and determine which is the most important thing to re-evaluate the application system. If the backup system has been replaced, it should be guaranteed that information on how to make new or upgraded backup systems are included in the "Modify". Maintenance of disaster recovery plans can help each other's mutual communication.