About HUFFMAN compression
0. The principle Huffman encoding is a variable length coding method, which is created by American mathematician David Huffman, a special conversion form of a binary tree. The principle of encoding is to convert the code of the number of times to a shorter length of length, and the number of times the number of use can use a long encoding, and the uniquely soluble solubility of the encoding is maintained. The most fundamental principle of the Huffman algorithm is that the cumulative (coded length of the character's statistics * characters) is minimized, that is, the weight (the code length of the character's statistics * characters).
Huffman tree
Huffman Tree is a special conversion form of a binary tree. The following is an example of the component Huffman tree: such as the following data, abfacgcahgbbaacecdfgfaeabbb advance row statistics A (8) B (6) c (4) d (1) e (2) f (3) g (3) h (1) brackets Inside is the number of statistics
Generate Huffman Tree: The two nodes (Node) each time you take into a node (Node), and add the cumulative value as the cumulative value of the new contact, the top layer is the root node (root) Note: list The smallest node is that all nodes including the merged nodes, the nodes that have been merged are not in the list.
The process is as follows: 1: D h (2) 2: de h (4) 3: f g (6) 4: C DEH (8) 5: B FG (12) 6: a CDEH (16) 7: ACDEH BFG (28)
So transformation into Huffman tree is
HUFFMAN tree layer
Root ┌┴┐ ACDEH BFG 1 ┌┴┐┌┴┐ CDEH A b FG 2 ┌┴┐ ┌┴┐ DEH C F G 3 ┌┴┐ DH E 4┌┴┐D H 5
Take the left is 1 right side is 0. Note: The number of layers is the number of digits or is the code length, weight = code length * The number of statistics of this word.
Code bit number weight a = 10 2 16 b = 01 2 12 C = 110 3 12 d = 11111 5 5 E = 1110 4 8 f = 001 3 9 g = 000 2 6 h = 11110 5 5
It can be seen that the Huffman code is unique decodable. If you read 110, it must be C, and there is no code starting at 110.
If you do not use the Huffman algorithm, use ordinary encoding, what is the result?
HUFFMAN tree layer
Root ┌┴┐ ABCD EFGH 1 ┌┴┐ ┌┴┐ AB CD EF GH 2┌┴┐┌┴┐┌┴┐ ┌┴┐A B C D E F G H 3
Take the left is 1 right, there is 0
Code bit number weight a = 111 3 24 b = 110 3 18 c = 101 3 12 D = 100 3 3 E = 011 3 6 f = 010 3 9 g = 001 3 9 h = 000 3 3 Using Huffman encoding The weight accumulation is 73. If you use ordinary minimum coding, use 84 character lengths. From this comparison, you can see how Huffman is compressed.
2. Encoding and decoding encoding: ABCDEFGH generated encoding generated by the Huffman tree corresponds to the file, and retains the original Huffman tree, mainly the information of the coding segment. Generally, 511 units are required to store 256 elements to store HUFFMAN trees, each HUFFMAN tree must have the following structures: Code, CHAR, LEFT, RIGHT, PROBILITY, usually uses an array structure. Because only you need to use Code during decoding, you only need to record the encoding of each element.
Decoding: Using the HUFFMAN encoding saved in the file, the encoding is interpreted, and the variable length coding is converted into a fixed length code.
3. Development Because Huffman coding needs to be scan twice, the first time is the statistics, the second is the code write file, which greatly affects the speed, so some people have invented Enhanced Huffman Aglorithm. This algorithm only scans a file, dynamically generates the HUFFMAN tree, that is, re-encoding a HUFFMAN tree per read to achieve the purpose of improving speed. Dynamic restore technology is used during decoding.