Original log file system for Linux: http: //www.byte.com/column/BYT20000524S0001 original author: Moshe Bar Translation: Brimmer for Linux journaling file system since the last 12 months, Linux has consolidated its position as the server The status of the operating system. As the cluster (Cluster) is important for enterprise-class applications, the Journaling File System is equally important. Why is the log file system is important? How does it work? What log file systems can I use for Linux? The log file system is safe than the traditional file system because it tracks the change in disk content with a separate log file. Just like the relational database (RDBMS), the log file system can submit or revoke the file system in the way. EXT2 does not meet the requirements, although Linux can support a wide variety of file systems, almost all Linux distributions use EXT2 as the default file system. The file system that Linux can support has: FAT, VFAT, HPFS (OS / 2), NTFS (Windows NT), Sun's UFS, and more. The designer of EXT2 mainly considers the efficiency and performance of the file system. EXT2 does not write the file content while writing Meta-Data (information related to files, such as permissions, owners, and creating and access time). In other words, Linux first writes the contents of the file, then wait until you have time to write the meta-data of the file. If the file system is suddenly powered off after writing the file content, the file system will be in an inconsistent state. In a system that requires a large number of files (eg, free web e-mail like Hotmail), this situation will result in very serious consequences. The log file system can help solve this problem. Assume that you are updating a directory entry. You have changed 23 file items (file entry) in the fifth file block of this huge directory item. When you are writing this file, suddenly the power is suddenly broken, this file block has not been written yet, that is, it is damaged. When restarting, Linux (like other UNIX) runs a program called "FiLe System Check), scans the entire file system, ensuring that all file blocks are properly assigned or used. It will find this damaged directory entry and try to fix it, but it is not possible to ensure that FSCK must repair damage. Can't fix it. So, when the above situation occurs, all file items in the catalog item may be lost, which will cause loss of files. If the file system is large, the FSCK scan fee is long. On a computer with billions of files, FSCK may run more than 10 hours. During this time, the system is unavailable, that is, a long time. The log file system can avoid this. How is the file system work? The file system stores the data in the storage device by assigning a file block for each file. This will maintain the allocation information of the file block of each file, and the allocation information itself must have a disk. DOS and Windows users may still remember the FAT this file system. Different file systems allocate and read file blocks in different ways.
There are two commonly used file systems allocation strategies: block allocation and extent allocation. Block assignment When the file is bigger, each time the file is allocated to this file, and the extension assignment is a series of consecutive blocks that allocate a series of consecutive blocks when the disk space of a file is not enough. The traditional Unix file system uses block allocation mechanisms provide a flexible and efficient file block allocation strategy. The file block on the disk is assigned to the file as needed, which reduces the waste of storage space. When a file slows slowly, it will cause discontinuous files in files. This leads to excessive disk seeking time, when reading a file, it is possible to random rather than continuous reading file blocks, which is very low. The random allocation of the file block can be avoided by optimizing the allocation policy of the file block (as possible to allocate continuous blocks as possible). By using a smart block allocation policy, a continuous allocation of blocks can be implemented. This reduces the tracking time of the disk. However, when the allocation of the file block of the entire file system, it is no longer allocated. When the file extension is expanded, the block allocated algorithm is written to some information about the location of the newly allocated block. If you only add a block when each file extension, you need a lot of additional disk I / O to write the structure information of the file block. The structural information of the file block is also the Meta-Data said. Meta-data is always written to storage devices together, which means that the operation of changing the file size is to be done after all META-DATAs are completed. Therefore, META-DATA operations significantly reduces the performance of the entire file system. The extent-based allocation mode extension allocation method is one-time to allocate many consecutive blocks. When creating a file, many file blocks are assigned simultaneously; when the file extension is expanded, there are many blocks allocated once. The file system's meta-data is written when the file is created. When the size of the file does not exceed the size of all allocated file blocks, it is not written to META-DATA. (Until that the file block is required again) This can be assigned a block to set the block, which can be set to set a large number of data to the storage device at a time, which can reduce the time of the SCSI device write data. The extension-allocated file system has good performance when reading sequential files, because file blocks are set up continuously. However, if the I / O operation is random, the advantage of the extension allocated file system is very limited. For example, when we want to continuously read a file based on an extended allocation, we only read the beginning block and the file length. Then, all file blocks can be continuously read, so when reading the file in order, the overhead of reading META-DATA is small. Conversely, if you are randomly read files, we must first find the block address of each required block and then read the contents of the block, so that the block allocation is very like. In an EXT2 file system, the enhancement of write performance is by delaying the time to write, so you can write a large number of data instead of each time you write a small point. The following is the improvement of system efficiency. Similarly, when reading, EXT2 is also a block of blocks, which is to adopt a pre-read policy. This will improve the read performance of the EXT2 file system, which reduces the I / O operation of each read a small amount of data. The size of the block or block cluster (block cluster) is determined when compiling. How to set the size of the cluster is not the content you have to introduce this article. However, it can be said that the size of the cluster has a great impact on the performance of the file system, and the size of the cluster is also a very important aspect of the file system design.
The file system and "Write-Clustering" file system like Veritas use "Write-Clustering), which uses a 512-byte block without a 1k byte block by default. If EXT2 uses 4K instead of 1k byte, there will be 20% performance improvement. However, the designer of the waste-type EXT2 file system is recommended to use a 1k byte block. How does the log file system solve the problem? First remind you: This section may be easily misunderstood. The log file system does solve some of the problems mentioned above, but it has brought new problems. The design idea of log files is to track changes in file systems rather than the content of the file system. In order to better explain this problem, let's take an example with the EXT2 file system and log file system: What happens when we change the content of the file "Test.File"? First assume that "Test.File" inode has four blocks. The block numbers of data blocks used to save the "Test.file" file are 3110, 311, 3506, and 3507, respectively (since the blocks between 3111 and 3506 have been assigned to other files, these blocks are discontinuous). When the hard disk is first found 3100, read two, jump to 3500, read two pieces to read the entire file. Assume that you have changed the third block, the file system reads the third block, change it, and then rewrite the third block, this piece is still in this location at 3506. If you add some content to your file, you have to assign some empty blocks from another place. If in the log file system, the situation is different. The log file system does not change the content of the 3506th, which saves a copy of "Test.File" inode and the new third block to the disk. The inode list in memory needs to be updated, let "Test.File" use new inode. All changes, add and change needs to be recorded in a file system called "log"). Every time, the file system updates the inode on the disk in "Check Point" and releases the old blocks that are not used in the file (for example: "Test.File" file initial third) . After the system crashes, the log file system can quickly recover. It needs to be recovered just a few pieces recorded in the log. After the power is powered off, "fsck" will use a few seconds of scan time. This is what I said, I have solved some problems! However, the file system is to pay for additional security, which is the system overhead. Every update and most "log" operations need to write synchronization, which requires more disk I / O operations. System administrators face such a problem: In order to have a safer file system value not worth sacrifice part of performance? Most system administrators will make decisions according to the actual situation. There is no need to put the "/ usr" directory on the log file system because most of the "/ usr" directory is read-only. However, you can consider putting "/ var" or placing a directory containing the E-mail SPool file on the log file system. Fortunately, these file systems can be used in the Linux system. A log file system has a problem is more likely to produce debris. Because its file allocation method is different, it is easy to generate a fragmentation in the file system. The EXT2 file system will also generate debris but may not be so serious. Returning the file system to the tape every month and resumes, not only solves this problem, but also check whether the backup / recovery process is correct. I want to get some benefits, always have to pay some price, isn't it? Alternative Linux log file system When I write this article, there are two log file systems are still developing, and there are three log file systems available.
SGI's XFS (http://oss.sgi.com/projects/xfs/) log file system and Veritas (www.veritas.com) file system and volume management (Volume Manager). These two file systems were released five months ago, but they can't get the source code. The XFS of SGI is XFS that is implemented on Irix (SGI's UNIX). SGI has announced XFS for the software of Open Source. The log file system that can be obtained immediately is Reiserfs and IBM JFS. Both file systems are open source and many talented people develop these two file systems. Developers of JFS (Journaled File System Technology for Linux) include the main developers of AIX (IBM UNIX) JFS. On AIX, JFS has been tested. It is reliable, fast and easy to use. Reiserfs applies some new technologies, for example, unified name space, is very promising (NameSys). Some Linux issues have included the Reiserfs file system as an optional option. SUSE 6.4 is easy to use the Reiserfs file system. The latest version of Reiserfs is 3.5.12, the result of test Reiserfs is very exciting. This test uses "Postmark" benchmark, 50,000 transactions to process 20,000 files. The result is: Sun E450 1 GB, Solaris 2.6, And VxFS (VERITAS) File System: 22 Transactions / Second Sun E450 1 GB, Solaris 2.6, And UFS File System: 23 Transactions / Second Dual P-III, 1 GB Linux 2.2. 13, Standard Ext2: 93 Transactions / Second Dual P-III, 1 GB Linux 2.2.13, Reiserfs 3.5.5 Journaling Beta: 196 Transactions / Second Dual P-III, 1 GB Linux 2.2.13, Reiserfs 3.5.5 journaling Beta Mount Options NOTAIL, GenericRead: 847 Transactions / Second (SUN computer is a Barracuda hard drive, the X86 computer is cheeta hard drive), and a log file system is JFS. JFS has not been adopted by any Linux distribution because it is 0.7. However, JFS progresses quickly. The author of this article has joined JFS projects now is JFS FAQ's maintainer. Reiserfs and JFS are very easy to install. After downloading the JFS package, there are also the following steps to complete: 1) Unzip JFSXXX.TAR.GZ. 2) The software has a patch with core (2.2.14, 2.2.15 and 2.3.99). Copy the patch to the directory of the kernel source code, usually in "/ usr / src".