gzip principle and implementation of: JIURL Home: http://jiurl.yeah.net Date: 2004-3-15
Gzip uses the DEFLATE algorithm to compress. Zlib, and graphic format PNG, the compression algorithm used is also a deflate algorithm. From Gzip's source code, we learned the principle and implementation of the DEFALTE algorithm. I read the GZIP version of Gzip-1.2.4. Below we will have an analysis and description of the DEFLATE algorithm. First briefly introduce the basic principles, and then detail the implementation. 1 Gzip The basic principle of the compression algorithm used Gzip For files to be compressed, first use a variant of the LZ77 algorithm to compress, and use the HUFFMAN encoded method for the resulting result (actually Gzip depending on the situation, select the use of static HUFFMAN encoding or dynamics Huffman encoding, details are described in implementation). Therefore, it understands the principle of compression of the LZ77 algorithm and Huffman encoding. Let's make a brief introduction to the LZ77 algorithm and Huffman coding. 1.1 LZ77 Algorithm This algorithm is proposed by Jacob Ziv and Abraham Lempel in 1977, so the compression principle named LZ77.1.1.1 LZ77 algorithm If there are two contents in the file, then just know the position of the previous piece and Size, we can determine the contents of the latter. So we can use (the distance between the two, the length of the same content), replacing the last piece. Due to (the distance between the two, the length of the same content) This pair of information is smaller than the size of the replaced content, so the file is compressed. Let's take an example below. There is a file as follows http://jiurl.yeah.net http://jiuml.nease.Net some of the contents, the previous section has already appeared, and the part thereof in () is the same part. Http://jiurl.yeah.net (http: // jiuml.) Nease (.NET) We use (distance between the two) such a pair of information such a pair of information, replacing the last piece. Http://jiurl.yeah.net (22, 13) NEASE (23, 4) (22, 13), 22 is the distance between the same content block and the current position, 13 is the length of the same content. (23, 4), 23 is the distance between the same content block and the current position, 4 is the same as the same content. Due to (the distance between the two, the length of the same content) This pair of information is smaller than the size of the replaced content, so the file is compressed. 1.1.2 LZ77 Use the sliding window to find a matching string LZ77 algorithm using the "Sliding Window" method to find the same part in the file, which is the matching string. Let's take a note here that it refers to a sequence of an arbitrary byte, not just a sequence of those bytes that can be displayed in a text file. The string here emphasizes its location in the file, and its length varies with the case. LZ77 starts at the beginning of the file, and one byte one byte is processed. A fixed size window (before the current processing byte, and next to the current processing byte), as the process is constantly sliding, it is like the shadow of the aircraft slipped the earth. For each byte in the file, the string started with the current processing byte, and each string in the window matches the longest matching string. Each string in the window refers to the string starting in each byte in the window. If the current processing byte starts in the window, there is a matching string in the window, and uses the (distance, matching length) such a pair of information, and then replaces the current string, and then the next byte after the string that has just been processed, Continue processing.
If the current processing byte starts in the window do not match the string in the window, the output is currently processed by the output. When the first byte in the file is handled, the window is still on the current processing byte, that is, it has not slipped to the file, and there is no content in the window, and the byte being processed will not be changed. As the process continues to move, the window is increasing into the file, and finally the entire window slides into the file, then the entire window slides backwards until the entire file ends. 1.1.3 Compression and Separation Use the LZ77 algorithm In order to decompress, it can distinguish between "no matching bytes" and "distance, matching length) pair", we also need to do not match the word The section "or" distance, matching length) is before, put a top, indicate that "there is no matching byte" or "distance between distance, matching length) pair." We use 0 to indicate "no matching bytes", use 1 to "(distance between distance, matching length)." In practice, we will fix (distance, matching length), "distance between" distance "and" matching length ". Due to the number of bits used in the "distance", we use fixed-size windows, such as 32kb size, then use 15 bits (2 ^ 15 = 32K) to save 0-32K Any value of the range. In practice, we will also limit the maximum matching length, so that the number of bits used by "matching length" is fixed. In practice, we will also set a minimum matching length, and we only think is a match only when the match length of the two strings is greater than the minimum match length. We will give an example to explain the reason for this. For example, "Distance" uses 15 bits, "" The length "uses 8 bits," the distance, the matching length) pair "will use 23 bits, which is 1 bit 3 bytes. If the matching length is less than 3 bytes, then "there is no compression, but will not be compressed, but will increase, but will need a minimum match length. Compression: From the beginning of the file to the end, one byte one byte is processed. Use the string that is currently processed bytes, and match each string in the sliding window to find the longest matching string. If the current processing byte starts in the window, you can first output a flag, indicating that the following is a (distance, matching length) pair, then output (distance, matching length) pair, Then continue processing from the next byte after the string that has just been processed. If the current processing byte starts in the window do not match the string in the window, then output a flag bit, indicating that the following is a byte that has no change, and then does not make the output of the change currently processed bytes, and then processes the current processing byte. The next byte. Unzip: From the beginning of the file to the end, read a sign bit each time, by this flag bit to determine the following is a (between the distance, matching length) or a byte that is not changed. If it is a (distance, matching length) pair, read the (distance, matching length between the fixed bit number), and then output the matching string to the current position according to the information in the pair. If it is a byte that is not changed, you will read a byte and then output this byte.
We can see that LZ77 needs to do a lot of matching work when it is compressed, and it is necessary to do when it is decompressed, that is to say that the decompression will be more fast relative to compression. This is a huge advantage for the case where it is necessary to compress and multiplex, is a huge advantage. 1.2 Huffman Encoding Introduction 1.2.1 Huffman encoded compression principle We regard the value of a certain length in the file as a symbol, such as 256 values of the 8-bit length, that is, the 256 value of the byte is a symbol. We re-coded these symbols based on the frequencies that appear in the file based on these symbols. For many times, we use fewer bits that we use more bit to repay. In this way, some part of the file becomes less, and some part of the number varies, since the portion of the portion is larger than the larger portion, the size of the entire file is still reduced, so the file is compressed. 1.2.2 Huffman Coding Use the Huffman tree to generate the encoding to make Huffman encoding, first read the entire file again, in the process of reading, count each symbol (we look at 256 symbols) The number of occurrences. The HUFFMAN tree is then established according to the number of symbols, and new encodings for each symbol is obtained through the Huffman tree. For symbols having more times in the file, its HUFFMAN encoded bit is less. For symbols having fewer frequent times in the file, its HUFFMAN encoded bit is more. Then replace each byte in the file to their new encoding. Create a Huffman tree: see all symbols as a node, and the value of this node is its number of times. Further regarding these nodes as a tree having only one node. Each time you find two trees from all trees, build a parent node for these two trees, then the two trees and their father nodes form a new tree, the value of this new tree is The value of the two subtots. Such reciprocation until the last tree turned into a tree. We got a huffman tree. Get a huffman encoding through the Huffman tree: this Huffman tree is a binary tree, and all of its leaf nodes are all symbols, and its intermediate nodes are constantly establishing during the process of producing the Huffman tree. We labeled 0 to the path to the path of the HUFFMAN tree to the path of its left sub-node. Now we start from the root node, the path to all leaf nodes is a sequence of 0 and 1. We use the root nodes to 0 and 1 sequences on a leaf node path, as a Huffman encoding of this leaf node. Let's take a look at an example. There is a file of the following abbbbcccddde we count the number of occurrences of each symbol, and the process of establishing the HUFFMAN tree as shown below with the final HUFFMAN tree, we can get the HUFFMAN encoding of each symbol. A is 110b for 01D to 10E 111 We can see that the Huffman tree is guaranteed, the number of symbols, the number of HUFFMAN has fewer encoded bits, the number of times the number of times, the resulting Huffman encoding bit Number. The length of the HUFFMAN encoding of each symbol is different, that is, becomes a long code. For bell-length codes, a problem may be encountered, which may not be able to distinguish these codes in re-encoding files. For example, the encoding of A is 000, B encoded is 0001, and c code is 1, then when 0001 is encountered, 0001 represents AC or represents B. This problem occurs is that the encoding of A is the encoded prefix of B.
Since the HUFFMAN is encoded to 0 and 1 of the root node to the leaf node path, the path of a leaf node cannot be the prefix of another leaf node path, so a Huffman encoding is impossible to encode another HUFFMAN. The prefix, which ensures that Huffman coding can distinguish. 1.2.3 Compression and Using Huffman Coding To get compressed HUFFMAN trees, we need to save the tree information in the compressed file, which is to save the number of occurrences of each symbol. Compressed: Read the file, count the number of occurrences of each symbol. According to the number of occurrences of each symbol, establish a Huffman tree to get the HUFFMAN encoding of each symbol. Saving information of each symbolic number of times in a compressed file, replacing each symbol in the file with its HUFFMAN encoding, and output. Unzip: Information that is saved in the compressed file, each symbol. According to the number of occurrences of each symbol, establish a Huffman tree to get the HUFFMAN encoding of each symbol. Replace each huffman encoding in the compressed file with its corresponding symbol, and output. 2 Gzip's implementation of compression algorithm we divide Gzip's implementation into many parts, one by one, which is the last part of this article. The various implementation techniques used in Gzip, the author of Gzip's authors are described in the source code. 2. See the implementation of the matching string for a string search matching string requires a lot of matching work, and we also need to find a matching string for many many strings. Therefore, Gzip use hash table in the implementation of the matching string to increase the speed. The goal to be reached is that for the current string, we have to find a string of each matching length to achieve the smallest matching string in its previous window, and identify the longest matching length. In Gzip, the minimum match length is 3, that is, two strings, at least the first three bytes, can be calculated. Why is the minimum match length of 3 and will be explained later. Gzip is in every string encountered, first insert it into a "dictionary". This is a string that matches it in the future, you can directly find this string directly from the Dictionary. Insert is not inserted, check is not a chaos. When inserted, use the first three bytes of the insert string to calculate the inserted "Dictionary" position, and then save the start position of the insert string in this Dictionary location. When I find out, use the top three bytes of the search string, calculate the "Dictionary" location, because insertion and use of the same calculation method, so if the top three bytes of the two strings are the same, The calculated "Dictionary" location is definitely the same, so it can be directly in the "Dictionary" location, and the start position of the string that is saved in the previous insert is taken. So I found a string, I found a string, and the top three bytes of this string were the same as their own (actually there is a great possibility that the reason will be explained later, so I found a matching string. If there are multiple strings, their top three bytes are the same, then their "dictionary" position is the same, they will be chained into a chain, put it in that "dictionary" position. So, if a string, find a "dictionary" position, and find a chain, all the strings of the top three bytes, all on this chain. That is, all matching strings before the current string are chained in a chain, placed in a "dictionary" location.
And the current string uses its first three bytes, you can get this "dictionary" location (getting a Dictionary "location, which first links yourself into this chain), and find it The chain has all of its matching strings, so find the longest match, which is to traverse every string on this chain, see and which strings are the largest. Let's see more specific instructions below to find the implementation of matching strings. The "Dictionary" we mentioned earlier is an array called Head [] (whose HEAD, will be described later). The "Dictionary" of our previous "Dictionary" is placed in a variable called INS_H. The chain we mentioned earlier is in an array called Prev []. Insert: The current byte is a starting strStart byte. Through the STRSTART, STRSTART 1, STRSTART 2, these three bytes, using a design of the INS_H, which is inserted. Then save the position of the current byte, TTRSTART, and STRSTART is saved in Head [INS_H]. Note by strStart, Strstart 1, strStart 2, these three bytes (that is, the first three bytes of strings at Strstart, that is, the current bytes and after the two bytes) determine INS_H. It is also strStart in HEAD [INS_H], which is where this string begins. Determine if there is a match: the first three bytes of the current string, use the hash letter counting INS_H, if the value of Head [INS_H] is not empty, then the value in Head [INS_H] is saved here before The other string position, and the top three bytes of the string calculate the INS_H calculated by the top three bytes of the current string. That is to say, there may be match. If the value of Head [INS_H] is empty, then there is certainly no match. The hash function used by Gzip: The hash function used by Gzip, calculates an INS_H with three bytes because the minimum matched is three bytes. For the same three bytes, INS_H obtained by hash function is inevitably identical. And the three bytes can get the same INS_H through the hash function, but this is not tight. When gzip discovers Head [INS_H], it means that there may be matching strings, will be on the chain Each string is compared to the true string. So the string on a chain is only the same as the value of the top three bytes, and the top three bytes are not necessarily the same. But this has been greatly reduced the range of string comparisons. Let's emphasize the same string of the top three bytes, inevitably in the same chain. In the same chain, not necessarily the top three bytes. Different three bytes may get the same result because three bytes, a total of 24 bits, 2 ^ 24 possible values. The calculation results of the three bytes of hash functions are 15 bits, and there are 2 ^ 15 possible values. That is to say, 2 ^ 24 values, corresponding to 2 ^ 15 values, inevitably, more, that is, there must be a variety of three bytes of value, calculated using this hash function They are all identical. And the reason why we use the hash function is that in fact, we are just in the range of a window size (later will be seen) look for matching strings, the size range of a window is very limited, three bytes that can appear The value combination is also very limited, which will be much less than 2 ^ 24, which is efficient to use the appropriate hash function. There are two functions in the top three bytes of all the strings in the first three bytes: Head [INS_H].
One role is a position where the first three byte calculations is the string of INS_H. Another role is an index in the prev [] array, with this index in Prev [], the previous three byte calculation results are the position of the string of INS_H. That is, the value of Prev [Head [INS_H]] (not empty) is the position of the string of INS_H for the previous three byte calculations. There are two values of prev []. One role is a position where the first three byte calculations is the string of INS_H. Another role is an index in the prev [] array. With this index in prev [], the previous three byte calculations will be found to be the bit of the string of INS_H. That is, the value of prev [] (not empty) is the position of the string of the previous three byte calculations as INS_H. Until prev [] is empty, the end is indicated. Let's give an example, string, 0abcd abce, abcf_abcg, calculates INS_H when processed to ABCG. At this time, the head [INS_H] is 11, ie the start position of "abcf abcg". At this time, the PREV [11] is 6, namely the start position of "ABCE ABCF ABCG". At this time, the PREV [6] is 1, ie the start position of "Abcd Abcf Abcg". At this time, it is 0 in PREV [1]. Indicates the end of the chain. We saw all the three letters of the ABC ABC, and the chain was together, from Head could have been looking for until 0. The establishment of the chain: Gzip When the current string is processed each time, INS_H is calculated using the first three bytes of the current string, and then the current string is also inserted into the corresponding chain, that is, the current string The location is saved in Head [INS_H], and at this time, Head [INS_H] (not empty) is the start position of the previous string. So this time you need the position of the previous string, that is, the original Head [INS_H] in the chain. So the current value of the current head [INS_H], indexes with the current string, saved to Prev []. Then, the Head [INS_H] is then assigned to the position of the current string. If the position of the current string is strStart, it is prev [strStart] = Head [INS_H]; Head [INS_H] = strStart; in this way, each time a string is added to the chain, the chain is formed. Now we know that the top three bytes are calculated to get all the strings of the same INS_H, and the head [INS_H] is a chain head, the earlier string in the PREV [] array. The name of the HEAD array and the prev array is also reacted. Chain features: The larger the distance between before (prev) and the current processing location. For example, the current processing string calculates INS_H, and the value in Head [INS_H] is not empty, then Head [INS_H] is the most similar match of the current processing string distance, and it is found in PREV [] The string, the farther from the distance. Insertion of the string started by the byte in the string: We said, all the strings starting with all bytes will be inserted into the "Dictionary". For the determined matching string, the string begins in the matching string, still being inserted into the Dictionary so that the rear string can match them. Note: For the 0th byte in the file, the situation is very special, and the position of the string start is 0.
Therefore, after the first three bytes of the 0 string calculate the INS_H, the position saved in the head [INS_H] is 0. The determination of whether or not it is possible, it is to be 0, and the value of HEAD [INS_H] is a string start position. So the string of 0th byte begins with its particularity, it will not be used to match, but this situation will only occur in the 0th byte, so it usually does not affect, even if it affects, it will be extremely small. . For example, the matching of the file content is found in JIURL JIURL, and [] part of the part. JIURL J [IURL] 2.2 Lazy Matches (lazy match) For the start of the current byte, after the maximum match is found, Gzip does not immediately decide to use this string to replace. That look at this matching length is satisfactory, if the match is not satisfactory, and the next byte starts the string of the string, then Gzip find the longest match of the string started by the next byte, see if it is better than Now this long. This is lazy match. If you are more than this long, you will not use this match. If you are shorter than this short, you will be sure to use this match now. Let's give an example, when the string 0ABC BCDE ABCDE processes the 10th byte, it is "ABCDE" A, where the maximum match is found, [] part. 0ABC BCDE [ABC] DE, then look at the next byte, which is the 11th byte, that is, 'Abcde "B, find the longest match, [] part .0ABC BCDE A [BCDE] found that the second matching match length is not used, and it is not used for the first match. We also see that if you use the first match, you will miss a longer match stroke. Matching lazy Under the premise, lazy, match, no limit, lazy, match the longer matching strings, will still be lazy, if this lazy match, found a longer match string, then last time The matching string found by the lazy match is not available. Matching lazy, matching is conditional. Matching lazy, matching must meet two conditions, first, the next string started by the byte, must match the string, if the next process The string started by the byte does not match the string, then determine the current matching string, not lazy matching. Second, the current matching string matching length, Gzip is not satisfactory, that is, the current match length is less than max_lazy_match (MAX_LAZY_MATCH is fixed in fixed Under the compression level, there is a fixed value). Discussion: We can see the reason for another attempt. If the current string matches, you may miss a longer match. Use lazy matching will improve. However, from my simple analysis, use lazy, matching to compression ratio seems to be very limited. 2.3 greater than 64kb file, the implementation window of the window implementation window: actually, the current string (the current processing byte starts ) Just finding a matching string in its previous window, that is, it is just to look for a matching string within a certain size before it. This limit will be explained later. Gzip window size is wsize, 32kb There is a buffer called Window [], the size of 2 windows is 64kb. The content of the file will be read in this window [], and we will do the LZ77 part of the LZ77 in Window []. The result will be placed in other buffers. Gzip's content in Window [], starting from the beginning, one byte one byte backward process.
There is a pointer called strStart (actually an index), pointing to the current processing byte, when the string started with the currently processed byte does not match, the output of the change is currently processed, and the strStart moves a byte backward. When the string started by the current processing byte finds a match, the output (match length, the separation distance) pair, the strStart moves the length byte backward. We end this part of Strstart to Window [], called Lookahead Buffer, and look forward to the buffer. The reason is that when we handle the current byte, it is necessary to read the byte of the byte to make a string. In a variable lookahead, the number of bytes left in the transfers are preserved. Lookahead, starting to initialize the size of the entire read content, as the process progresses, STRSTAR is continuously moved, and the superfine check buffer continues to decrease, and the value of Lookahead is also reduced. We need to limit the scope of the lookup matching string is the size of a window (which is described later), that is, it can only look for a matching string within 32KB before the current processing byte. However, since the processing is in the two window size, it is performed in a buffer of 64KB, so the distance between the strings and the current strings on the matchchard chain is very likely to exceed 32 kB. So how is GZIP to achieve this limit? Gzip achieves this limit by the judgment condition when matching. When the current string calculates INS_H, it is found that the Head [INS_H] value is not empty (Head [INS_H] is a string start position), indicating that the current string may have a match string, saved this value in hash_head. At this time, it is necessary to make a limiting scope, strStart - hash_head <= window size, strStart-hash_head is the distance between the current string and the nearest matching string, (Note that the head and the distance of the current string have recently, The more the distance between the front (prev) and the current processing location is, that is to say whether the distance between the current string and the nearest matching string is within the range of one window. If not, then other strings on the chain are definitely farther, and no match is not allowed to be within the scope of a window. If it is within the range of a window, you will need to find the longest matching string on the chain. When comparing each string, it is also necessary to judge whether the distance of the current string and the string exceeds a window. If you exceed it, you cannot match. In practice, Gzip intends to make the code simple points, the distance limit is smaller than the size of a window. Less than 64KB files: When initialization, you will first read 64kb content from the file to Window []. For files less than 64kb, the entire file is read into Window []. In Window [], the LZ77 is processed, from the beginning until the file ends. More than 64KB file: Every byte is to judge the lookahead Since the size of the ultra-probably view buffer we can use is, the maximum match length (258 bytes, followed by) plus the minimum match length, that is, the next string started by the next process, you can find a biggest match The length of the length, after a match, it is necessary to pre-read a minimum match length to calculate the after INS_H. Whether it is a file greater than 64kb, or less than 64KB file, with the process of processing, it will eventually go to the end of the file, and the LookAhead Since the content of the second window has been copied into the first window, we can read the new content into the second window, and the content of the 32KB before the new content is the original second window. At this time, in the modified dictionary, there is still information of all strings in the second window, that is, the new content, can continue to use the strings within the range of the front window, compress, this is also reasonable. 2.4 Other Issues 1 Let now explain why the minimum match length is 3 bytes. This is due to, in Gzip, (matching length, separation distance) pair, "matching length" ranges from 3-258, which is 256 possible values, and 8bit is required to save. The "separation distance" ranges from 0 to 32K, and 15bit is required to save. So a (matching length, separation distance) requires 23-bit, one bit of 3 bytes. If the matching string is less than 3 bytes, use (matching length, separation distance) to replace, not only without compression, but will increase. So save (matching length, separation distance), determines the minimum match length to be at least 3 bytes. The reason for the maximum matching length is 258 is that various factors determines that the match length is saved with 8 bits, and the maximum of 8 bits is 255. In practice, "matching length" in (matching length, separation distance) is saved, the actual matching length - minimum match length, so the actual match length of 258 corresponds to 258. When matching, the matching length is judged to ensure that the maximum match length is guaranteed, the match is stopped. That is, even if there are two strings, the same portion exceeds the maximum matching length, it also matches the maximum matching length. The number of digits and window sizes for saving the separation distance is determined by each other, and the various factors are integrated, and the window size is determined, and the number of bits used in saving the separation distance is determined. 2.5 Gzip's LZ77 part of the implementation of the point Gzip's LZ77 part is mainly in the function defalte (). The buffer WINDOW [] is used to place the content read in the file. l_buf [], d_buf [], flag_buf [] is used to place the result of LZ77 compression. Each byte in l_buf [] is a one-free byte, or a matching pair of matching lengths - 3. l_buf [] shared inbuf []. Every unsigned short in D_BUF [] is a matching pair of matching. Each of the flag_buf [] is a flag to indicate that the corresponding byte in l_buf [] does not match the byte, or a matching pair of matching lengths - 3. Prev [], Head [] is used to store dictionary information. In fact, HEAD is the macro definition prev wsize. During the initialization process, lm_init () is called. In LM_INIT (), 2 windows are read from the input file, that is, 64KB of content into Window []. The number of read bytes returned in Lookahead. Update_hash, initialize INS_H using both bytes in Window. DEFLATE (), in one processing loop, first insert_string puts the current string into the dictionary, INSERT_STRING is a macro, the role is to calculate the current string of INS_H with the hash function, then put the contents of the original head [INS_H], link into the chain In (placed in the prev), the original Head [INS_H] is saved in the Hash_Head variable, and the matching judgment is used later, and then the start position of the current string is saved in Head [INS_H]. It is said that the content saved in hash_head is not empty, indicating that there is content on the matchchard. Call LONGEST_MATCH () Look for the longest match on the matchchard chain. The content saved in hash_head is empty, indicating the string started by the current byte, and there is no match in the window. Due to the use of Lazy Match, the case where the judgment is more complicated. Match the output of the string, or the output of the resulting byte is called the function ct_tally (). For matching strings, after output, you also need to use INSERT_STRING to match each byte in the matching string, and insert the string of each byte in the matching string into the dictionary. In CT_TALLY (), put the incoming "no matching byte" or "match length -3" in l_buf [], then do a statistical number of statistics for future HUFFMAN encoding, if incoming matches In the incoming parameter, there is a separation distance, and the separation distance is stored in D_BUF []. According to the incoming parameters, it can be judged which case, then set the corresponding flag in one variable, each 8 sign bits, it is one byte, saving to FLAG_BUF []. There are still some judgments, we will explain later. 2.6 The result of the block output LZ77 compression is placed, L_Buf [], D_BUF [], FLAG_BUF []. For the compression results of the LZ77, a block output is performed using a piece of output or divided into a plurality of outputs (after the LZ77 compresses a certain portion, one block output is performed). The size of the block is not fixed. When the output is output, the compression results of LZ77 are used, and Huffman encoding is performed, and finally the result of the HUFFMAN encoding is output to the Outbuf [] buffer. Work in Huffman encoding and output, and do in flush_block (). In ct_tally (), it is determined. If some conditions are met, after returning from CT_TALLY (), the result of the existing LZ77, the HUFFMAN encoding is performed, and output into a block. At the end of the full file processing, the deflate () function will end, the result of the LZ77 will perform HUFFMAN encoding and output to a block. In CT_TALLY (), the number of bytes in l_buf [] is increased by 0x1000 (each byte is a one-free byte or a matching length), that is, 4096. The case where the compressed is estimated, to determine if this block is better, if you feel better, you will output a block. If you feel bad, you will not output it first. When L_BUF [] is full, or D_BUF [] will definitely compress the existing LZ77 compression, Huffman encoding, output to a block. If you decide to output a piece, you will only establish a Huffman tree for this piece, which will be compressed by Huffman coding and output to Outbuf []. If it is a dynamic huffman encoding, the information of the tree is also output to Outbuf []. After output, INIT_BLOCK () will be called, initialize a new block, reinitialize some variables, including the node of the dynamic tree, which means that the statistics will be restarted for the future Huffman tree future. The size of the output block is not fixed. First, before the HUFFMAN encoding, the size of the content to be output is not fixed, and it is necessary to see the situation. After the HUFFMAN encoding is performed, it is not fixed. The size of the block is not fixed, so how to district the block when decompressed. There is a node indicating the end of the block, EOB, and outputs this node's encoding at the end of each output block, so when this node is encountered, it indicates that a block is over. 2 bits per block, used to indicate which coding method is used by this block, 00 denotes direct storage, 01 represents a static HUFFMAN encoding, 10 represents a dynamic hUffman encoding. The next 1 point indicates whether the block is the last piece, 0 means not, 1 is the last piece. Output a block, there is no impact on the content in the present dictionary, the next block, will still match the previously formed dictionary. 2.7 Static HUFFMAN Coding and Dynamic Huffman Encoding Static HUFFMAN Coding is to use GZIP to pre-define a set of encoding for compression and decompression, which does not need to be transmitted to generate trees. Dynamic huffman coding is the number of occurrences of statistics, establish a Huffman tree, generating HUFFMAN encoding of each symbol, compressed with this generated HUFFMAN, which requires information of the spanning tree. Gzip The static Huffman tree will also establish a static huffman tree, and the dynamic huffman tree, and then calculate the size of the block, the size of the block, the generated block, and the generated block, and the generated block, as well as the generation Huffman tree coding, generating blocks. The comparison is performed, and the HUFFMAN encoding is performed using a small method of generating blocks. For static trees, it is not necessary to transfer the part of the information to generate a tree. The dynamic tree needs to pass this information. When the file is relatively small, the information transmitting the generated tree is not worth, but it will make the compressed file be large. That is to say that when the file is relatively small, there may be a small block encoding using a static HUFFMAN encoding than using a dynamic HUFFMAN encoding. 2.8 The generation of the resulting deflate algorithm is based on the Huffman tree, and several rules have been added. We call such a tree as a deflate tree, so that you can get all the nodes as long as you know the number of nodes from all positions. Code. The reason for this is to reduce information that needs to be stored in a compressed compressed file. To understand how to generate Huffman coding, deflate must figure out some Huffman trees, and the properties of the deflate tree, the following is a simple study of the Huffman tree and the Deflate tree. The nature of the Huffman tree 1 leaf node is N, then the summary point of the whole tree is 2N-1. Simply prove the instructions, the procrit, the smallest tree, that is, only three nodes, one root node, the tree of the two leaves nodes. Then do the smallest addition to the tree in any part of the tree is also in line with. So it is in line. 2 The coding of the leftmost leaf node is 0, but the length is not necessarily. In the DEFLATE, the nature of the huffman tree of additional conditions is added 1 The same size of the coded value of the long leaf node is continuous, and the right side is a larger than the left. 2 (n 1) length The leftmost leaf node (that is, the value of the leaf node of the coded value) is the value of the leftmostmost left leaf node (the maximum leaf node of the encoded value). 1, then become a long one (that is, 1 bit). 3 n-long leaf nodes, the rightmost leaf node (that is, the maximum leaf node of the coded value) is the value of the leftmost leaf node (that is, the leaf node of the coded value) plus The number of n-length leaf nodes is reduced by 1.4 (n 1) long left left leaf node (that is, the smallest coding value of the coded value) is the N-bit long leftmost leaf node. (I.e., the value of the leaf node of the coded value), adds the number of N-bit long leaf nodes, and then becomes a length (that is, 1 bit). There are also some trees, such as the maximum possible coding number of a certain depth of the tree. From all encoded bit, get all encoded encodings: Statistically the number of codes on each bit is placed in BL_Count []. According to the value in BL_Count [], the minimum coding value on each bit is calculated, and in Next_code []. Calculation method is, code = (Code BL_COUNT [BITS-1]) << 1; reason is the nature of the deflate binary tree, (n 1) length of the left left leaf node (that is, the coded value minimum leaf node The value of the value of the N-bit long left side (that is, the leaf node of the coded value is the leaf node), plus the number of n-ranks, and then becomes a length (that is, left shift 1 Bit). Then follow the code value to encode all code. The encoding method is, a certain length corresponding to Next_code [n], the first start is the encoding of the leftmost leaf node on the left, then , is the next position of the next leaf node, according to the next time The class is pushed until the leaf node of this bit is encoded. In fact, encoding is bi_reverse (next_code []). This reason why the encoding is that the nature of the deflate binary tree. 2.9 5 trees have 5 trees static_ltree [], static_dtree [], dyn_ltree [], DYN_DTREE [], BL_TREE []. For all the trees, the value of the symbol indicated by a leaf node is n, then the corresponding leaf node corresponding to this symbol is placed in Tree [n], such as the value of Static_ltree's leaf node 'a' is a decimal 97, then The leaf node of 'a' is placed in static_ltree [97]. Static_ltree [] When the static huffman encodes, the tree used to encode no changed bytes and match lengths. Static_dtree [] When static huffman encodes, the tree used to encode the separation distance. Dynamic HUFFMAN When encoded, the tree used to encode no modified bytes and match lengths. Dyn_dtree [] Dynamic HUFFMAN When encoded, the tree used to encode the separation distance. BL_TREE [] Dynamic HUFFMAN When encoded, the tree used to decompress the decompression to generate DYN_LTREE [] and DYN_DTREE []. The static tree is directly encoded for each leaf node when initialization. Dynamic tree, each time you want to output a piece of content, based on this piece of content, generate a dynamic tree, and then generate encoding for each leaf node according to the generated dynamic tree. Each time you want to output a piece of content, you will have the size of the block encoded with a static tree, and the size of the block obtained by using the dynamic tree, then who is generated, who is generated, who is small. Use static_ltree [], static_dtree [], use static_lttree [], to perform encoded output. With dynamic encoding, use DYN_LTREE [], DYN_DTREE [], BL_TREE [] to perform encoded output. 2.10 Leaf Node LTREE (used to encode the tree, static, dynamic all) of the non-changed byte and matching length, is a total of L_Codes, which is 286. 0-255 256 Leaves Node, is the 256 value of bytes 256 1 leaf node, which is end_block, used to indicate the end of the block. 257-285 29 leaf nodes are 29 scope of matching length. Dtree (used to encode the separation distance, static, dynamically, all) of a total D_CODES, that is, 30. 0-29 30 leaves nodes are 30 scope of the separation distance. BL_TREE's leaf node is a total of BL_CODES, that is, 19. 0-15 Indicates that the encoding length is 3-6 times before the encoding bit is 0-15.16. The next two indicate the number of repetitions. 17 Repeat the encoding bit of 0, 3-10 times, and the three afterwards indicate the number of repetitions. 18 Repeat the encoding bit of 0, 11-138 times, and then 7 points indicate the number of repetitions. 2.11 Static HUFFMAN Coding Initialization Base_Length [], Length_code [], Base_Dist [], dist_code []. Base_length [] is, 29 matching length range, the length value starts at each range. Length_code [] is the range to 256 possible matching lengths. For example, base_length [9] = 0xA, indicating that the start value of the nine range is 0xA. Length_code [11] = 9, indicating a matching length of the matching length of 11, belongs to the 9th range. Base_dist [], 30 matching ranges, each range of starting values, the smallest value in each range. Dist_code [], this is a bit special, with a total of 32K value, here 32K value, divided into two major categories, 0-255 This 256 values are one class, then they directly for dist_code []. 256-32K is a class, then they remove 7 bits, and the highest bit, the remaining 8 is index, 8 seats just at 256 items. What can be done is that the first maximum 32K distance needs a maximum of 15 bits, so the 16-bit highest bit will not be used, followed by the remaining boundaries of these ranges to a binary 1 000 0000 integer multiple. For example, the matching distance is 10, less than 256, so it belongs to class DIST_CODE [10] = 6, Class 6. The matching distance is 10K, greater than 256, so it belongs to class dist_code [256 10k >> 7] = dist_code [256 10240 >> 7] = dist_code [256 80] = dist_code [336] = 0x1a = 26, belongs to 26 Class, 26 categories range from 8193-12288, 10240 is within this range. Specifies the length of each Literal. (There will be 288 Literal. Including 256 bytes 1 EOB 29 match length range = 286. Many 2 are for the tree. ) And count each of the Literal numbings in BL_COUNT []. According to the value in BL_Count [], the minimum coding value on each bit is calculated, and in Next_code []. Calculation method is, code = (Code BL_COUNT [BITS-1]) << 1; reason is the nature of the deflate binary tree, (n 1) length of the left left leaf node (that is, the coded value minimum leaf node The value of the value of the N-bit long left side (that is, the leaf node of the coded value is the leaf node), plus the number of n-ranks, and then becomes a length (that is, left shift 1 Bit). Then from the Litral value of 0 and to the Litral value. Coded for each Litral. The encoding method is, a certain length corresponding to Next_code [n], the first start is the encoding of the leftmost leaf node on the left, then , is the next position of the next leaf node, according to the next time The class is pushed until the leaf node of this bit is encoded. In fact, encoding is bi_reverse (next_code []). For example, Tree [n] .code = bi_reverse (next_code [len] ; this time next_code [len] value is binary 00110000, 0x30tree [n] .code final assigned to binary 00001100, 0x0c, we get it Static_ltree [], it stores Litral corresponding encoding with Litral's value. For example, 'a' is a decimal 97, static_ltree [97] .code = 0x0089 static_ltree [97] .len = 8. The encoding of A is a binary 10001001. Code for static_dtree. This code is very simple. Since all nodes are 5 long (specified), according to the DEFLATE binary tree, the leftmost leaf node is encoded to 0, and each time it adds 1, until all the leaves node . Note that BI_REVERSE () is also required. That is, the inverse stream encoded as "bit stream corresponding to the path of the root start to a leaf node." The resulting output of the LZ77 processing is encoded with the HUFFMAN encoding. At this time, each byte is Litral or the length -min_match in l_buf []. D_buf [] is a matching distance, each item is 16 bits. Flag_buf [] The corresponding byte in Inbuf [] is LITERAL or the flag of the matching length -min_match, such as the FLAG_BUF No. I. Read each bit in the Flag_buf, which is judged. If 0, the byte in the corresponding L_Buf is Literal. If it is 1, the byte representing the corresponding L_BUF is the matching length -min_match. For Litral, use the value of l_buf [] to do the index, you can get the encoding, and encoded length in Static_ltree, and then output this encoding. For match length -min_match, use l_buf [] to index, first get the range of this matching length in length_code [], a total of 29 scope. That is to say, the matching length -min_match range is (3..258), a total of 256 possible values. These 256 possible values are allocated in 29 scope. We use the value of l_buf [] to index, in length_code [], the range of this matching length is obtained. Then use the range value 256 1 to Litral. Use this Literal to make an index, get the encoding in Static_ltree, and the encoding length, and then output this encoding. Then use the range value to be indexed, and the number of additional bits of this range is obtained in extra_lbits []. If the number of additional bit numbers are not 0, the additional bit is output. Attached to the value in INBUF [], is the corresponding base_length (] of the matching length -min_match minus this range. Then remove from D_BUF [], match the distance. When the matching distance is less than 256, the index is made with a matching distance, and the corresponding range value is removed in DIST_CODE. When the matching distance is not less than 256, the 7 bits are shifted with the matching distance, that is, the corresponding range value is taken in DIST_CODE 256 with high 8 bits. For matching distance, the matching distance of the matching distance is the value of (1..32, 768), a total of 32K possible values. It is divided into 30 scope. Since the value of the matching distance is in the range, (1..32, 768), the matching distance is used in 15 bits. Then do the index with the range of distances, and the encoding is obtained in Static_Dtree [], and the encoding length is obtained. Then output this encoding. Then use the range value of the distance to obtain the number of digits of the additional bit of this range in extra_dbits []. If the number of additional bit numbers are not 0, the additional bit is output. The additional bit of the output is dist-base_dist [code]. For example, take out a Dist 10. Dist_code [10] = 6, indicating that it belongs to the sixth range. Then check extra_dbits, extra_dbits [6] = 2, indicating that there are two extra bits. Local int Near extra_dbits [d_codes] / * extra bits for each distance code * / = {0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13}; first output static_dtree [6] .Len bit stream, static_dree [6] .code. (Static_dtree is 5) then outputs the bitstream of the extra_dbits [6] bit, 10-base_dist [6] = 10-8 = 2 = binary 10. After sending each byte in the Inbuf, the encoding of end_block is finally transmitted. 2.12 Dynamic HUFFMAN Codes Determine all Literal, matching length range, and the number of occurrences of matching distance range. In the process of LZ77 compression, CT_TALLY () is called for each Litral or match. In CT_TALLY (), if it is a Literal, DYN_LTREE [LC] .freq . If it is a match, DYN_LTREE [LENGTH_CODE [LC] Literals 1] .freq , DYN_DTREE [D_CODE (DIST)]. FREQ . Call build_tree () establishes Literal and match the length range (which is the leaf node of DYN_LTREE, a total of 286), and is encoded for them (Literal and matching length range). In the spanning tree, Heap [] is a buffer used to assist the generated tree. First of all, all the number of times in Tree [] is not 0 (that is, index, such as Tree [0x61] is the correspondence item of 'a'), and in HEAP []. Tree [] The number of elements is 2 * l_codes 1, l_codes is the number of leaves nodes, 286. From the nature of the Huffman binary tree, the leaf node is n, then the summary point of this tree is 2N-1. Tree [] will be used to save the generated tree. Tree [] The first L_CODES item used to store the leaf node. For example, the node information of 'A' is placed in Tree [0x61]. L_codes After the item is used to place an intermediate node. HEAP [] will be used to generate temporary content from the process of the tree. HEAP [] The size is also 2 * l_codes 1. Its front l_codes is used to generate the nodes in the tree, and the beginning is the leaf node. With the spanning tree, the two leaf nodes are removed and put them into their intermediate nodes. After L_Codes, it is used forward. During the spanning tree, all nodes (roots, intermediates, leaves) will be placed here in the order of weight. When generating a length in the future, it needs to be used. PqdownHeap (Tree, Smallst); the role is to find the nodes in the HEAP, find the smallest FREQ, put it in Heap [1]. The process of spanning the tree is, each time you find two minimal nodes, then give them a form of a final. And the corresponding contents of their tree [] points to their father 's fingers. And remove these two nodes in HEAP, and add their parent nodes to HEAP. Heaplen is the number of points in HEAP, each time due to 2 knots, plus 1 node, so each time will make Heaplen, that is, the node number is less. Waiting to Heaplen, that is, the number of nodes, when less than 2, indicating that the tree has already been done. After the tree is good, the domains in Tree [] are set up, and gen_bitlen () is called to generate a length for them. In Gen_bitlen (), starting from the root, the root length is 0, down, and the bit is set for each node. It is determined whether or not the leaf node, the method of judging is to see if it is greater than the maximum code, where the largest code is 286. When you encounter a leaf node, dynamically encoding the total length of the total length of the entire file, and the total length of the total number of static encoding is performed. BL_COUNT [BITS] ; used to generate encoding for a while. Since the number of appearances of this node is stored in the FREQ field in the leaf node, there is now a long position, so the dynamic bit of the node can be calculated. And all the dynamic distractions of all nodes is a total length. With the number of occurrences, for static, the node length is set, and it can also be calculated. Finally, the gen_codes () is called to generate encoding for all leaf nodes. The method in static huffman is the same. Calling build_tree () to establish a tree that matches the range (also from DYN_DTREE's leaf node, a total of 30), and encodes them (matching distance range). The method that generates DYN_LTREE is the same. Call build_bl_tree () to generate a tree for L (Literal & Match Length) and D (Matching Distance) and encodes these character. Call the Scan_Tree statistics of the coding length in a tree. DYN_LTREE and DYN_DTREE are statistically respectively. Scan_tree ((ct_data near *) Dyn_ltree, l_desc.max_code); scan_tree ((ct_data near *) Dyn_dtree, d_desc.max_code); statistical results are placed in BL_Tree []. It is easy to understand the work made in Scan_Tree. For example, BL_TREE [0] .freq represents the number of encoders of the encoding bit of 0. BL_TREE [10] .freq represents the number of encoders of the encoding bit of 10. BL_TREE [16] .freq indicates that the number of occurrences of continuous coding lengths is the same, and the number of occurrences of this situation. Finally, build_tree () builds the tree (that is the 19 cases), and encodes them (that is, 19 cases). Send a node bit of the BL_TREE encoded. In the DEFALTE algorithm, you can get the encoding of the leaf node as long as you know the distinction of each of the leaves of the tree. So we need to send LTREE's 286 leaf nodes, we need to send the length of 30 leaf nodes of Dtree. First, a variant of the maximum leaf node value of three trees is sent. Send_bits (LCODES-257, 5); Send LTREE effective maximum leaf node value 1-257send_bits (dcodes-1, 5); send DTREE effective maximum leaf node value 1-1send_bits (BlCodes-4, 4); send BL_TREE effective maximum leaf node value 1-4. LTREE The maximum leaf node value determines the number of LTREE's leaf node sectors of LTREE to be sent. Just send it to the maximum leaf node number. For example, the LTREE effective maximum leaf node value is 0x102, then we only need to send a length of the first 0x103 in LTREE, and tell the decompression program, send 0x103. Sending the bit length of the BL_Tree, paying that the order sent is sent in the order specified in BL_ORDER []. Call Send_Tree () Send DYN_LTREE, the bit length of DYN_DTREE. In Send_Tree (), use the same method in scan_tree (), first look at which one of 19 cases of 19 leaf nodes of BL_TREE, after which one is determined, according to this The corresponding leaf node, encoding in BL_Tree, sends this encoding. Until the length of these positions. The resulting output of the LZ77 processing is encoded with the HUFFMAN encoding. The method used by the static huffman encoding is the same. 2.13 Target No. 1, save the LZ77 to indicate that "no change bytes" or "matching information pair" flag. Due to the Gzip implementation, the range of matching lengths and byte values are made as different leaves nodes. For example, the byte of the value is 1, and a matching length of a value of 1, the value is the same, but they are different leaf nodes, their coding is also different. In this way, when unzipped, you can directly distinguish it, you don't have to output the indication bit. This savings should be helpful for improvement in compression ratios. When the static huffman encodes, the encoding itself does not have a compressive effect, but will benefit from this savings. Second, the contents of the leaf node. In the implementation of Gzip, we represent the content represented by the leaf node, more than just a fixed value, but some represent a range of values, (and then use more bits to represent this range) A value), and representative. This implementation method is quite good, it is very worth reference. Decompression is not said, the reason will be seen.