ZIP compression principle and implementation

xiaoxiao2021-03-06 52

http://www.blueidea.com/bbs/newsdetail.asp?id=1819267&page=2&posts=&daysprune=5&lp=1

Non-destructive data compression is a wonderful thing, thinking about it, a string of arbitrary data can be converted to only the original 1/2 - 1/5 length according to certain rules, and can be restored to the original look according to the corresponding rules It sounds cool.

For half a year, I have been disappointing and dissatisfied with MFC and SDK when I have been working in the MFC, SDK, but I don't have substantive differences with DHTML, but I have a variety of various features provided by Microsoft. The function, don't need you to create a window, multiple threaded programming, you don't need you to assign the CPU time. I have also drove, the same, there is DDK (Microsoft Drive Development Cap), of course, there are DDK's "Reference Manual", even a simple data structure does not need you, everything is a function, function ...

Microsoft's senior programmer prepared a function to make us use the application. I don't want to use people who have applied it here. It is these application engineers to connect the bridge between science and society, and can do sales and management in the future. Wisdom and experience that have accumulated themselves in society with their own efforts.

However, in terms of technology, in terms of man, this is not high, isn't it? For the first-stream of companies, such as Microsoft, Sybase, Oracle, etc., so that they can have huge markets. But they are often the top top of society: operating system, compiler, databases worthy of experts to continue research. These empire-like enterprises are great, I am afraid it is not "experience", "can suffer", "can suffer", the concept of these Chinese character can be covered, the hard technical system, the modern management philosophy, the powerful market capacity is not possible Bar. Since we are interested in the technology, and the starting stage, why must I turn to do "management", do "youth talents", those so-called "successful people" can have a geometry, such an impetuous, chest How big can the scale and pattern?

In I found that VC is just a wide range of programming tools, I can't represent "knowledge", "technology", I am somewhat lost, nothing is not me, but MFC, SDK, DDK, is Microsoft engineer, they Dear, it is exactly what I want to do, or say, I also want to be the hierarchy, now I know, they are experts, but this will not be a dream, one day I will do, why can't you Say my thoughts.

At that time, there was a compressed module in the system, the leader found a Zlib library, did not let me do the compression algorithm, stand on the company's position, I understand, really understand, how long it takes yourself. But at that time, I hidden in my heart made me find information about the principle of compression. I didn't realize that I was about to open a door and entered a magical "data structure" world. The first line of "computer art" is actually taken on the body of me.

Top "Computer Art", or further refinered that "computer programming art", sounds very well, very elegant, but when you want to enter professional compression algorithms, I want to do the first thing to do is : Forget your age, education, forget your social identity, forget the programming language, forget all terms such as "object-oriented", "three-layer architecture". Take yourself as a child, have a pair of well-known eyes, is full of tireless, simple curiosity, the only premise is a normal brain with human rational thinking. Let's start a magical compression algorithm journey:

1. Principle section:

There are two forms of repetition existing in computer data, and ZIP is compressed by these two repetitions.

One is the repetition of the phrase, that is, the repetition above three bytes, for this repetition, ZIP with two numbers: 1. Duplicate position is distance from the current compression position; 2. Repeat the length, to indicate this repetition, Assuming that these two numbers accounted for one byte, then data is compressed, which is easy to understand.

One byte is 0 - 255 a total of 256 possible values, three bytes are 256 * 256 * 256 a total of more than 1,600 possible situations, and the possible situation of longer phrase is growing in an index mode. The probability of repetition seems to be extremely low, in fact, various types of data have repeated tendencies, a paper, numerous terminology tends to repeat; a novel, the name of the name will repeat; A top gradient background picture, the pixels in the horizontal direction repeat; in the source file of the program, the symptory keyword will repeat (when we write the program, how many times before and after Copy, Paste?), In dozens K In the data of the non-compressed format, it is tended to repeat the phrase repetition. After compression after the above mentioned manner, the tenden repetition tendency is completely destroyed, so the second phrase compression is generally no effect on the result of compression.

The second repetition is a single-byte repetition, and one byte is only 256 possible values, so this repetition is inevitable. Among them, some bytes may have more times, and others are less, and there is a tendency to distribute unevenness in statistics, which is easy to understand, such as an ASCII text file, some symbols may be rarely used. The letter and numbers are used, and the frequency of use of each letter is also different. It is said that the use probability of the letter E is the highest; many pictures present a dark tone or light tone, dark (or light) pixels more (here By the way: PNG picture format is a lossless compression, its core algorithm is the main difference between it and zip format files in: As a picture format, it stores the size of the picture in the file header, use Information such as color number; the result of the phrase compression mentioned above also has this tendency: repetition tends to appear at a relatively close to the current compression position, and repeated length tends to be relatively short (within 20 bytes). In this way, there is a compressed possible: re-encoding 256 bytes, so that more bytes use shorter encodings, fewer bytes use longer encoding, so that shorter The byte is more than the length of bytes, and the total length of the file will decrease, and the more unevenness of the byte usage, the larger the compression ratio.

Before further discussing the requirements of the encoding and the method, first mention: coding compression must be processed after the phrase compression, because the original eight-digit value of the eight-digit value is destroyed, so that the phrase repeat in this file The tendency will also be broken (unless first decoded). In addition, the results of the phrase compressed: those remaining unqualified single, double-bytes, and matching distances, the length value still has the unevenness of the value distribution, and therefore, the order of the two compression modes cannot be changed. After the coding compression, the tendency of the non-uniform value in the original uncompressed file in the original uncompressed file is completely destroyed, and it is randomized, according to statistical knowledge, random sexy The value has a tendency of uniformity (such as a coinaround test, throwing a thousand times, the number of positive and reverse faces is close to 500 times). Therefore, the result after the encoding compression is no longer possible to perform coding compression.

The phrase compression and coding compression is the current two non-destructive compression methods studied in the current computer science. Therefore, it is no longer repeated, so the compressed file cannot be compressed again (actually, the compression algorithm that can be repeated is unimaginable. Because it will eventually be compressed to 0 bytes).

The tendency of the phrase repeated tendency and byte value distribution uneven tendency is the basis of compression. The two compressed order cannot be interchangeable. Let's see the requirements and methods of coding compression:

The compressed file cannot be compressed again because:

1. The phrase compression removes the repetition above three bytes, and the compressed results are included in the unmatched single-double byte, and the combination of matching distances, lengths. This result is of course possible to contain three bytes over the repetition, but the probability is extremely low. Because three bytes are 256 * 256 * 256 a total of more than 1,600 possible situations.

So just press the "natural existence" phrase repeating tendency in the original file, it is possible, and one thousand thousand points is the probability that it is not necessary.

2. Coded compression utilizes a different tendency to use the frequency of individual single-bytes, so that the pixel coding becomes an unregulated code, gives a shorter code with high frequency, using a low frequency of bytes, and a longer code. The effect of compression. If the "result" of the encoding compressed "Result" is used as 1 byte, the frequency of use of each byte should be substantially equal. Because the new byte frequency is random. It is meaningless to change byte length short, because there is no shorter byte that is shorter than the increasing byte.

Therefore, after the single-byte frequency of "natural existence" in the original file is not uniform, the random usage frequency is then compressed and the meaning is lost.

First, in order to express a single character using unordered coding, the encoding must conform to the "prefix coding" requirement, that is, shorter coding will never be a longer coded prefix, in turn, any character encoding, is not Another character's encoding plus several bits 0 or 1, otherwise the decoder will not decode.

Look at the simplest example of the prefix code:

Symbol code

A 0

B 10

C 110

D 1110

E 11110

With the above code table, you must easily distinguish the true information content from the bunch of binary streams below:

111001010111011011111100010 - Dabdceaab

To construct a binary encoding system that meets this requirement, the binary tree is the ideal choice. Investigate the next two fork:

Root

0 | 1

----- --------

0 | 1 0 | 1

--- ------ - ----

| | | | | |

A | D e

0 | 1

--- ----- | |

B C

The characters to be encoded always appear on the leaves, assume that from the root of the root of the leaves, the left is turned 0, and the right to 1, the code is encoded from the root of the leaf of the character. It is because the characters can only appear on the leaves, and the path of any character will not be the prefix path of another character path, and the prefix coding that meets the requirements is successful:

A - 00 B - 010 C - 011 D - 10 E - 11

Next to see the process of coding compression:

In order to simplify the problem, assume that only A, B, C, D, E four characters in one file, and their appearances are

A: 6 times

B: 15 times

C: 2 times

D: 9 times

E: 1 time

If you use a fixed length encoding method to these four character encodings: a: 000 b: 001 C: 010 D: 011 E: 100

Then the length of the entire file is 3 * 6 3 * 15 3 * 2 3 * 9 3 * 1 = 99

The four-fork tree indicates these four codes (where the number on the leaf node is the number of uses, the number on the non-leaf node is the sum of the number of children thereof):

root

--------- 33 ---------

| | |

---- 32 --- -- 1 -

| | | | | |

-21- -11- 1 -

| | | | | | |

6 15 2 9 1

(If there is only one child node, you can remove this child node.)

root

---- 33 ------

| | |

--- 32 ---- 1

| | |

- 21 - - 11 -

| | | | | |

6 15 2 9

The current coding is: A: 000 B: 001 C: 010 D: 011 E: 1 Still complies with "prefix coding".

The first step: If the number of the lower node is found to be larger than the number of the upper node, they exchange their position and recalculate the value of the non-leaf node.

First exchange 11 and 1. Since 11 bytes have been shortened, one byte has increased by one, and the total file is shortened by 10.

root

-------- 33 ---------

| | |

--- 22 ---- -- 11 ----

| | | | | |

- 21 - 1 2 9

| | |

6 15

Switches 15 and 1, 6 and 2, and finally get this tree:

root

-------- 33 ---------

| | |

--- 18 ---- -- 15 ----

| | | | | |

- 3 - 15 6 9

| | |

twenty one

At this time, the value of all upper nodes is greater than the value of the lower layer node, and it seems that it cannot be further compressed. However, we combine the minimal two nodes of each layer, often find still compressed rooms.

Step 2: Combine the minimum two nodes of each layer to recalculate the value of the relevant node.

In the tree above, only one or two nodes in the first, second, and fourth layers are unable to recombine, but there are four nodes on the third layer, we combine the smallest 3 and 6 and recoate relevant The value of the node becomes the tree below. root

-------- 33 ---------

| | |

---- 9 ----- ---- 24 ----

| | | | | |

- 3 - 6 15 9

| | |

twenty one

Then, repeat the first step.

At this time, 9 of the second layer is less than 15 of the third layer, which can be interchangeable, and 9 bytes increased by one bit, and 15 bytes shorten one, and the total length of the file is shortened. Then recalculate the value of the relevant node.

root

-------- 33 ---------

| | |

15 ---- 18 ----

| | |

---- 9 ----- 9

| | |

- 3 - 6

| | |

twenty one

At this time, all upper nodes are found to be larger than the lower node, and the minimal two nodes on each layer are together, and it is impossible to generate a smaller parent node than the homing other nodes.

At this time, the length of the entire file is 3 * 6 1 * 15 4 * 2 2 * 9 4 * 1 = 63

At this time, it can be seen that a basic premise of coding compression: the value between the nodes is divided into a variable, so that the other nodes of the two nodes and less than the other node of the same level or the lower layer, so that the exchange node has benefit.

Therefore, the use frequency of the byte using the byte in the original file must be different, otherwise there will be no frequency of the two nodes and less than the frequency of the other nodes of the same level or the lower layer, and cannot be compressed. Conversely, the more a variety of differences, the frequency of the two nodes is more than the frequency of the same level or the lower node, the greater the interests after the exchange node.

In this example, it has been repeated after two steps, and the best binary tree is obtained, but it cannot be guaranteed to obtain the optimal binary tree through these two steps, and now look at another example:

root

------- 19 --------

| | |

---- 12 ------ 7

| | |

--- 5 - --- 7 ---

| | | | | |

-2- -3- 3- -4- | | | | | | | |

1 1 1 2 1 2 2 2

In this example, all upper nodes are greater than or equal to the lower node, and the minimum two nodes per layer are combined together, but can still be further optimized:

root

------- 19 --------

| | |

---- 12 ------ 7

| | |

--- 4 - --- 8 ---

| | | | | |

-2- -2- -4- -4-

| | | | | | | | |

1 1 1 1 2 2 2 2

The third layer of 8 is greater than the second layer of 7 by the fourth 5 node of the lowest layer.

Here, we have to make such a conclusion: a one-to-two-binary coding tree (all upper nodes can be exchanged with the lower node), which must meet the following two conditions:

1. All upper nodes are greater than equal to the lower node.

2. At a certain node, all nodes of any of the large sub-nodes of M, the smaller child node N, M should be greater than all nodes of the layer equal to N.

When in accordance with these two conditions, either layer cannot generate smaller nodes and lower node exchanges, and cannot produce a larger node to go to the upper node exchange. The above two examples are relatively simple, actually files, one byte has 256 possible values, so the leaves of the binary tree are more than 256, which requires constant adjustment of the tree, and the final tree may be very complicated. There is a very delicate algorithm that can quickly build a best two-fork tree. This algorithm is proposed by D.huffman (Dai Hoffman). Let's first introduce the steps of the Hoffman algorithm, and then proof The tree derived from such a simple step is indeed an optimal binary tree.

The steps of the Hoffman algorithm are like this:

· From all nodes to find the minimum two nodes, build a parent node, and the value is the sum of these two nodes.

The two nodes are then removed from the node sequence to add their parent nodes into the sequence.

Repeat the above two steps until there is only a unique node left in the node sequence. At this time, a one of the best binary trees has been built, and its root is the remaining node.

Still look at the Establishment process of Hoffmann with the example above.

The initial node sequence is like this:

A (6) B (15) C (2) D (9) E (1)

Combine the smallest C and E

| (3)

A (6) b (15) D (9) ------ ------

| | |

c e

Continuously, the finally get the tree is like this:

root

--- 33 -----

| | |

15 ---- 18 ----

| | |

9 ------ 9 -----

| | |

6 - 3 -

| | |

twenty one

At this time, the encoding length of each character is the same as the coding length we have said before, so the total length of the file is also the same: 3 * 6 1 * 15 4 * 2 2 * 9 4 * 1 = 63

Variety of node sequences in each step in the establishment of Hoffmann:

6 15 2 9 1

6 15 9 3

15 9 9

15 18

Below we use the reverse push method to prove that the tree established by the Hofman algorithm is always an optimal binary tree for a variety of different node sequences.

The establishment of Huffman Tree is used to use reverse push method:

When there is only two nodes in this process (such as 15 and 18 in the precedent), it is certainly an optimal binary tree, one encoded is 0, and the other encoding is 1, and it is no longer possible to further optimize.

Then step forward, and the node sequence is constantly reducing a node, adding two nodes, which will always be an optimal binary tree during the step process, because:

1. Follow the establishment of the Hoffman tree, the new two nodes are the smallest two in the current node sequence, and the parent nodes of any of the other two nodes are greater than the parent of the two nodes. As long as the previous step is the most important binary tree, the parent nodes of any other two nodes must be on the upper or same level of their parent nodes, so the two nodes must be at the lowest layer of the current bifurcation.

2. These two new nodes are minimal, so they cannot be exchanged with other upper nodes. The first condition of the optimal binary tree that is in the foregoing.

3. As long as the previous step is the best two-fork tree, because the two new nodes are minimal, even other nodes in the same level, they can not be combined with other nodes of the same level, producing a lower-level node than their parent nodes. Come with other nodes of the same level. The parent nodes are smaller than the parent nodes of other nodes, they are less than all nodes, as long as the previous step is in line with the second condition of the optimal binary tree, it will still meet this step. This step is stepped down, and every step in Hofmann is always a one-top tree in this process.

Since each step removes two nodes from the node sequence, a node is added, the Hoffman tree has a total of (raw nodes - 1), so the Hoffman algorithm is not a delicate coding compression algorithm. .

Attached: For the Huffman tree, "computer programming art" has a completely different proof, most is like this:

1. The number of internal nodes (non-leg nodes) of the binary coding tree is equal to the number of external nodes (leaf nodes) minus 1.

2. The sum of the weighted path length of the external node of the binary coding tree (value by the path length) is equal to the sum of all internal node values. (These two can be proved by using mathematics induction of nodes, left to everyone to do exercises.)

3. The establishment process of the Huffman tree is used to push. When only one internal node is only an optimal binary tree.

4. Top forward, add two minimal external nodes, which combine together to generate a new internal node, and only when the original internal node set is extremely small, it is still extremely extreme after adding this new internal node Minimize. (Because the minimum two nodes are combined together, it is in the lowest layer, which is at least incremented by at least the weight of the weighting path relative to them, respectively.

5. As the number of internal nodes increases one by one, the total maintenance of the internal node set is minimized.

2. Implementation

If there is no compression program in the world, we have seen the previous compression principle, and there will be a priority that can compress the data that can compress most formats, the content, will find there when we start to do such a program. Many puzzles need us to go to a resolution, the following will describe these problems one by one, and analyze how ZIP algorithms solve these problems, including universal significance, such as finding match, such as array sorting, etc., these are Say the problem, let us go deep into it and think.

to be continued. . . .

转载请注明原文地址:https://www.9cbs.com/read-75624.html

9cbs

New Post(0)