Gzip Gzip Zlib PNG compression algorithm source code Detailed Author: JIURL Home: http://jiurl.yeah.net Date: 2004-3-1
(Test Edition) Gzip, Zlib, and graphic format PNG, use the same compression algorithm deflate. Through the analysis of Gzip source, we make a detailed description of the DEFLATE compression algorithm. I read the GZIP version of Gzip-1.2.4. We do three degrees of explanation of algorithms. The first extent, for the basic principles of the compression algorithm used by Gzip. The second extent, a description of the method of achieving the Gzip compression algorithm. The third degree, the source code level of Gzip is realized. If you have time, I suggest you don't look at the content below, try to understand how its compressed decompression is implemented by reading the Gzip source code, which will be a very interesting intelligence game, don't miss . When another mystery was unwrapped, then, like Tang Bo Hu comrade, "Generous Nobi Gold". (Poetry of Xiao Tang, in addition to another unfamily egg Cao Xueqin, it seems that it is not too much.) 1 Gzip's basic principle of compression algorithm Gzip For files to be compressed, first use the LZ77 algorithm to compress, and the result Compression using the Huffman encoding method. Therefore, we will explain the principles of LZ77 and Huffman encodings. 1.1 ... 1.2 ... 2 Gzip compression algorithm implementation method 2.1 LZ77 algorithm Gzip implementation First, Gzip reads 64KB of content to a buffer called Window from the file to be compressed. For the sake of simplicity, we will explain to the compression of 32KB or less. For us to use 32KB below, Gzip reads the entire file into the Window buffer. Then use a variable called STRSTART in the Window array, move from 0 to move backward. STRSTART is in each location, in its previous area, finds a string that matches the headed strings of the string started by the current strStart, and attempts to find the longest matching string from these matching strings. If the current string started by the current strStart, you can find a matching string of 3 bytes. The current strs started the length of the matching length, which will be replaced by a
In order to improve the comparative speed, Gzip uses a hash table. This is the key to Gzip implementation LZ77. This hash table is an array called HEAD (later we will see why this buffer is called Head). Gzip's three bytes of the string in Windows, which are strstart, strstart 1, strStart 2, and use a design-design hash function to calculate, get an insertion position INS_H. That is, three bytes of the string are used to determine an insertion position. Then put the position of the string, that is, the value of strStart, saved in the INS_H item of the HEAD array. We can see why do so so. When the HEAD array is not inserted with any value, all is 0. When the three bytes of the current string of somewhere determine an INS_H, and the position of the current string at that time is STRSTART in HEAD [INS_H]. The other place, when the first three bytes of the current string, when the three bytes are used, then use the hash function to calculate, because the same three bytes, the same hash The function, the resulting INS_H necessarily and the same INS_H is the same. So I will find that HEAD [INS_H] is not 0. This explains, there is a first three bytes and the same string to save your own position in this, and now the value saved in Head [INS_H] is the start position of the string, we can find that string. That string is at least the top 3 bytes of the top 3 bytes of the current string (we can see this statement later, here is convenient), we can find that string and further comparison, See how long it can do so. We now explain that the same three bytes, the INS_H obtained by the hash function must be the same. And different three bytes, is it possible to get the same INS_H through the hash function, I don't study this hash function, not clear, but the general hash function is like this, so it is very large. It will be this situation, that is, different three bytes, it is possible to get the same INS_H through the hash function, but this is not tight, we find that it may be a comparison of strings after matching. In a file, there may be many strings of the first three bytes of the string, that is, what they calculate the same, how can you guarantee that every string in them? Gzip uses a chain to chain them together. Gzip When the current string is inserted into the INS_H of the current series of the current series of Head, the value of the original Head [INS_H] will be saved to an array called Prev, and the position is in the location. The current strStart is now. This calculates INS_H when the current string somewhere is calculated, and when head [INS_H] is not available, you can find the same string of the same string of the first three bytes in Prev [Head [INS_H]]. For this, we will explain it. Example, string 0abcDabceabcfabcg ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ INS_H is calculated from ABC. At this time, the Head [INS_H] is 13, ie the beginning of "abcg". At this time, the start position of 9, ie "abfabcg" is 9, ie. At this time, PREV [9] is 5, ie "abceabcfabcg" start position. At this time, PREV [5] is 1, ie "abcDabceabcfabcg" start position. At this time, it is 0 in PREV [1].