The origin of data compression is much earlier than the origin of the computer, and the data compression technology has been put on the agenda, military scientists, mathematicians, and elections, how to be high efficient storage and e-mail. Problem. With the production and development of information theory, data compression is also evolved from the true technology. Data compression can be divided into two types, one called lossless compression, and another is called loss of compression. Lossless compression refers to reconstruction using compressed data (or is called restore, decompressed), and the reconstructed data is identical to the original data; lossless compression is used to require reconstructed signals to be completely consistent with the original signal. The compression of disk files is a very common example. According to the current level of technology, the lossless compression algorithm can generally compress the data of ordinary documents to the original 1/2 ~ 1/4. Lossless compression refers to reconstruction of compressed data. The reconstructed data is different from the original data, but will not make people misunderstand the information expressed by the original data. Lossless compression is suitable for reconstructed signals that are not necessarily the same as the original signal. For example, the compression of images and sounds can be used to use loss compression, because the data contained therein is often more than our visual system and auditory system can receive information, throwing out some data and not expressing the meaning expressed by the sound or image. Misunderstorm, but it can greatly increase the compression ratio.
Compression technology can be classified in accordance with the following methods:
Compression Technology | / ---------------------------- / General lossless data compression multimedia data compression (mostly damaged compression) | | / --------------------------------------------------------------------------------------------------------------------------- ---- / Based on statistical dictionary-based audio compressed image compression video compression model pressure | | | shrinkage technology MP3, etc. / ----------------- - / AVI | | 二 Vector MPEG2, etc. / ------ / / ----------- / Image Image Image Image Huffman Arithmetic LZ77 LZ78 LZW | | / Code Code / --------- / Fax machine felics gif postscript | | | Standard JPEG, etc. WMF, etc. WMF, etc. WMF, etc. Close upless pkzip, Lharc, ARJ,
Historical scientists, such as advanced applications such as Compress programs, etc. under Compact Compressed Limit Unix, found that there is a certain redundancy in the study, and there is a certain redundancy, by adopting a certain model and coding method. This redundancy can be reduced. The Claude Shannon and Mit of the Bell Labs R.M.Fano almost also proposed the earliest Shannon-FANO encoding method for data compression. D.a.huffman published his papers in 1952, "a Method for the Construction of Minimum Redundancy Codes). Since then, data compression begins to implement and be applied in many technical fields in business programs. A compressor Compact on a UNIX system is a specific implementation of Huffman 0th order adaptive encoding. In the early 1980s, Huffman encoding was implemented in the CP / M and DOS systems, and its representative program called SQ. In the field of data compression, this paper in Huffman has tried the era of data compression technology. In the past 190s, the 1970s and even the 1980s, the data compression is almost always monopolized by Huffman coding and its branches. If it is not the two Israelis that will be mentioned later, maybe we have to live in the combination of Huffman encoded 0 and 1 today. In the 1980s, mathematicians were not satisfied with some of the fatal weaknesses in Huffman coding. They started from a new perspective, follow the leading ideas of Huffman coding, and design another more precise, more accessible to the "entropy" limit in the information theory. Coding method - arithmetic code. With the exquisite design and excellence of arithmetic coding, people can finally go forward to the limit of data compression. It can be proved that the compression effect obtained by the arithmetic code can maximize the redundancy of the information, and accurately express the original information content with the least amount of symbols. Of course, arithmetic coding also brings new challenges to programmers and computers: To implement and run arithmetic coding, you need more difficult programming labor and more fast computer systems. That is, on the same computer system, although the arithmetic coding can get the best compression effect, but consumes a few more times the calculation. This is why arithmetic coding cannot be achieved in our daily compression tools. So, can it be both compressed to surpass Huffman, and do not add the program's demand for system resources and time? We must thank the two Israelis that will be introduced below. Until 1977, the research work of data compression was mainly concentrated in entropy, characters and word frequencies and statistical models, and researchers have been racing their brains to find faster, better improvement methods for programs encoding using Huffman. After 1977, everything changed. In 1977, the Israelites Jacob Zivpel published a general algorithm for the sequential data compression (a universal aligrithem for sequential data compression). In 1978, they published the continued sequence of the paper "Compression of Individual Sequences Via Variable-Rate Coding). The compression techniques proposed in these two papers are called LZ77 and LZ78, respectively (I don't know why, the first letter of the author name is inverted). Simply put, these two compression methods are completely different from traditional ideas from Shannon to Huffman to arithmetic compression, and people will be called "dictionary" coding based on this idea. Dictionary coding is not only larger than HUFFMAN, but also & # xf
F0C; For good implementation, its compression and decompression speed is also amazing. In 1984, Terry Welch published a papers called "a technique for high-performance data compression" describing his research results in Sperry Research Center (now part of Unisys). He realized a variant of the LZ78 algorithm - LZW. LZW inherits LZ77 and LZ78 compressed effects, fast and fast, and is more susceptible to people in algorithm description (some researchers think that Welch's paper is more easily understood than ZIV and Lempel), and is relatively simple . Soon, there is a Compress program using the LZW algorithm, which has excellent performance and high-level documentation, soon become the compressed program standard for UNIX. Followed by the ARC program in the MS-DOS environment (System Enhancement Associates, 1985), as well as PKWARE, PKARC and other imitation products. LZ78 and LZW ruled two platforms in UNIX and DOS a time. After the mid-1980s, people have improved LZ77, and a number of compression procedures we were still useful today were born. The ARJ of the Lharc and Robert Jung of Haruyasu Yoshizaki (Yoshi) is two famous examples. LZ77 is able to monopolize today's general data compression field with LZ78, LZW. At present, dictionary-based compression has already had a widely recognized standard, from ancient PKZIP to the current Winzip, especially with the popularity of file transfer on the Internet, Zip format has become the fact standard, no one The generic file compression, the archive system does not support ZIP format. This chapter mainly introduces the most mature non-destructive compression coding technology that is currently used and technologically, including HUFFMAN coding, arithmetic coding, RLE coding, and dictionary coding. Note that some compression algorithm is protected by the US patent law (eg, some of the LZW algorithms and some of the high-order arithmetic compression algorithms). 4.1 Xiannong - Fano and Huffman Post Code 4.1.1 Shannon-Fano Coding Xiannong - Vano coding algorithm requires the following two basic concepts: 1. Entropy (Entropy) (1) Entropy is a metric method of the amount of information, indicating that the amount of information truly requiring encoding in a message. The smaller the possibility of events (the less mathematical probability), the more messages that appear in an event. (2) The amount of information for an event is expressed (sometimes referred to as surprise) in II = -log2 pi, where Pi is the probability of the i-th event, and 0 Pi 1 logs at 2 is the bottom, the unit of entropy is "bits" ". 2. Entropy of Source S According to the theory of Shannon, the entropy defined for Source S is the probability of which PI is the symbol Si appears in s; log2 (1 / pi) indicates the amount of information contained in Si, That is, the number of digits required to encode Si. For example, an image represented by a 256 gradation, if each pixel point grayscale is PI = 1/256, the encoding of each pixel point requires 8 bits. (Maximum Entropy Distribution) Minimum Entropy Distribution: In addition to the probability of the remaining symbols is 0, h = 0BITS. (Define 0log20 = 0) For example, the string of three characters of ABC: Aabbaccbaa, The string length is 10, and the characters ABC appears 5 3 2 times, and the probability that ABC appears in the information is 0.5, respectively.
0.3 0.2, their entropy is: EA = -log2 (0.5) = 1eb = -log2 (0.3) = 1.737EC = -log2 (0.2) = 2.322 The entropy of the entire information is also expressed in the number of bits required for the entire string To: EA * 5 EB * 3 EC * 2 = 14.855 bits If the ASCII codes commonly used in the computer, the above string needs a whole 80! Why can the information are compressed without losing the original information content? Simply put, use fewer numbers to indicate more frequent symbols, which is the basic criterion of data compression. (How to use the binary digital code such as 01 to represent a few binary braces? It is really difficult, but not there is no way. Once you find the way to accurately indicate a few binary positions, it is close to lossless compression.) [Example 4.1 ] There is a grayscale image consisting of 40 pixels, grayscale has a total of 5, respectively, with symbols A, B, C, D, and E, and 15 pixels in 40 pixels have 15 pixels. There are 7 pixels of grayscance B, and there are 7 pixels of grayscance C, etc., as shown in Table 4-01. If the five levels of grayscale values are used in 3 bits, that is, each pixel is represented by 3 digits (equal length coding), and 120 bits are required to encode this image. Table 4-01 The number of symbols appears in the image According to Xiannong Theory, the entropy of this image is h (s) = (15/40) × LOG2 (40/15) (7/40) × LOG2 (40 / 7) ... (5/40) × log2 (40/5) = 2.196 This means that each symbol is expressed in 2.196 bits, and 40 pixels need to be 87.84 bits. The earliest elaboration and implementation of Shannon (1948) and FANO (1949), so known as the Shannon-Fano algorithm. This method is encoded from top to bottom. First, in accordance with the frequency or probability of symbols,, for example, A, B, C, D, and E, as shown in Table 4-02. Then use the recursive method into two parts, each portion has an approximate number of times, as shown in Figure 4-01. According to this method, the total bit number of coded is 91, and the actual compression is about 1.3: 1. Table 4-02 SHANNON-FANO Algorithm Example Table Symbol ABCDE A number of times 15 7 7 6 5 Symbols (PI) LOG2 (1 / Pi) Allocated Code Required Bits A 15 (0.375) 1.4150 00 30B 7 ( 0.175) 2.5145 01 14C 7 (0.175) 2.5145 10 14D 6 (0.150) 2.7369 110 18e 5 (0.125) 3.0000 111 15 Page 4 4 4-01 Xian Nong - Vano Algorithm Code 4.1.2 Huffman (Huffman Coding Huffman proposed another coding method from the lower to the down to the coding method in 1952. It is now illustrated by a specific example: (1) Initialization, according to the size of the symbol probability, sort the symbols from the large to small order, as shown in Table 4-03 and Figure 4-02. (2) The two symbols that minimize probability are a node, as D and E in Fig. 4-02 constitute node P1. (3) Repeat step 2 to obtain nodes P2, P3, and P4 to form a "tree", wherein P4 is called a root node. (4) From the root node P4 to "leaves" corresponding to each symbol, from top to date, "0" (branch) or "1" (branch), as for which is "1" is "0" The last result is only the same as the code, and the average length of the code is the same. (5) From the root node P4, the code is written to each of the leaves to each leaf, as shown in Table 4-03. (
6) According to Xiannong Theory, the entropy of this image is h (s) = (15/39) × log2 (39/15) (7/39) × LOG2 (39/7) ... (5/39 ) × log2 (39/5) = 2.1859 Compression ratio 1.37: 1. Table 4-03 Hoffman code example Figure 4-02 Hoffman coding method Huffman code code length is variable, but does not need additional synchronization code (prefix code). For example, the first bit in the code string is 0, and that the end is definitely symbol a, because the code indicating that other symbols does not start with 0, so the next bit indicates the number of times the next symbol code appears ( PI) LOG2 (1 / Pi) allocated code required by the number A 15 (0.3846) 1.38 0 15B 7 (0.1795) 2.48 100 21C 6 (0.1538) 2.70 101 18D 6 (0.1538) 2.70 110 18e 5 (0.1282) 2.96 111 15 Page 5 4. Similarly, if "110" appears, it represents the symbol D. If you write a "dictionary" that explains the meaning of the various code in advance, that is, the codebook can be decoded according to a codebook code according to the codebook. Similar to Xiannong-Vano coding, both methods are self-contained, and there is no need to add marker symbols (i.e., splitting the symbol when decoding the special code) in the code after encoding. There are two problems when using Hoffman code: 1 Huffman code has no error protection function. When decoding, if there is no error in the string, then you can correct the code correctly. But if there is an error in the string, even if it is only one error, not only this code itself is wrong, but worse is a big string, all chaotic set, this phenomenon is called error propagation. The computer does not have power to this error, saying that there is something wrong, and you can't talk about it. 2 Huffman code is a variable length code, so it is difficult to find or call the contents of the compressed file, then decode, which requires consideration before the storage code. Despite this, Huffmann is still widely used. The coding efficiency of the Hoffman coding method is higher than the fairy farmer - Vano coding efficiency. 4.2 Arithmetic coding arithmetic codes play an important role in image data compression criteria (such as JPEG, JBIG). In the arithmetic coding, the message is encoded with a real number between 0 and 1, and the miscode is used to two basic parameters: the probability and encoding interval of the symbol. The probability of the source symbol determines the efficiency of compression coding and determines the interval between the source symbol during the encoding process, and these intervals are included between 0 and 1. The interval during the encoding process determines the output after the symbolic compression. The encoding process of the arithmetic encoder can be explained by the following example. [Example 4.2] Assume that the source symbol is {00, 01, 10, 11}, and the probability of these symbols is {0.1, 0.4, 0.2, 0.3}, respectively, according to these probability, the interval [0, 1) can be divided into 4 sub-intervals. : [0, 0.1), [0.1, 0.5, [0.5, 0.7), [0.7, 1), where [X, Y) represents a semi-open interval, i.e., containing X does not contain Y. The above information can be integrated in Table 4-04. Table 4-04 Source symbols, probability, and initial encoding intervals If the input of the binary message sequence is: 10 00 11 00 10 11 01. The symbol that first entered when encoding is 10, finding its encoding range is [0.5, 0.7). The encoding range of the second symbol 00 in the message is [0, 0.1), and therefore, [0.5, 0.7) is taken, and one-tenth of the first half is used as the new interval [0.5, 0.52). According to such push, the new interval is [0.514, 0.52) when encoding the third symbol 11, when encoding the fourth symbol 00, the new interval is [0.514, 0.5146), .... The encoding output of the message can be in the last interval.
Arbitrary number. The entire encoding process is shown in Figure 4-03. Symbol 00 01 10 11 Probability 0.1 0.4 0.2 0.3 Initial encoding interval [0, 0.1) [0.1, 0.5) [0.5, 0.7) [0.7, 1) Page 6 4 Figure 4-03 Moderate encoding process Example of this example encoding and The whole process of decoding is shown in Table 4-05 and Table 4-06, respectively. According to the examples mentioned above, the calculation process can be summarized as follows. Consider a character table set with M symbol I = (1, 2, ..., m), assuming probability p (i) = pi, and. The input symbol is used to represent the XN, and the range of the nth subinterpret is represented. Where L0 = 0, D0 = 1 and P0 = 0, LN represents the value of the left boundary, the RN represents the value of the interval and right boundary, DN = RN-Ln represents the interval length. The encoding step is as follows: Step 1: First assign each symbol between 1 and 0, each symbol is assigned an initial sub-spacing, the length of the sub-interval is equal to its probability, the initial sub-spacing ranges I1 = [L1, R1) = [,) Representation. Let D1 = R1-L1, L = L1 and R = R1. Step 2: L and R binary expressions are represented as: and wherein UI and VI are equal to "1" or "0". 1 If u1 ≠ V1, do not send any data, go to step 3; 2 If U1 = V1, the binary symbol U1 is transmitted. Compare U2 and V2: If u2 ≠ V2 does not send any data, go to step 3; if U2 = V2, the binary symbol U2 is transmitted. ... This comparison has been different until the two symbols are different, then go to step 3. Step 3: N plus 1, read a symbol. It is assumed that the Nth input symbol is XN = i, divided by the previous step into the sub-spacing as shown below: Page 4 4 L = Ln, R = RN and DN = RN-Ln, then go to step 2 . Table 4-05 Encoding Process Table 4-06 Decoding Process [Example 4.3] The source of 4 symbols, their probability is shown in Table 4-07: Table 4-07 Symbol probability input sequence is xn: 2, 1, 3, .... Its encoding process is shown in Figure 4-04, which is now described below. The first symbol is X1 = 2, and I = 2, define the initial interval = [0.5, 0.75), which is known that D1 = 0.25, the binary number of the left and right boundary is expressed as: L = 0.5 = 0.1 (b), respectively, R = 0.7 = 0.11 ... (b). According to step 2, U1 = V1, send 1. Because U2 ≠ V2, go to step 3. Enter the second character x2 = 1, i = 1, its sub-interval = [0.5, 0.625), thereby obtaining D2 = 0.125. The binary number of the left and right boundary is expressed as: L = 0.5 = 0.100 ... (b), R = 0.101 ... (b). According to step 2, U2 = V2 = 0, the 0 is transmitted, and the u3 and V3 are different, and therefore, it will be transferred to step 3 after transmitting 0. Enter the third character, X3 = 3, i = 3, which subsection = [0.59375, 0.609375), thus available to D3 = 0.015625. The binary number of the left and right boundary is expressed as: L = 0.59375 = 0.10011 (b), r = step input
No. 1 10 [0.5, 0.7) Symbol of Interval [0.5, 0.7) 2 00 [0.5, 0.52) [0.5, 0.7) Interval 1/103 11 [0.514, 0.52) [0.5, 0.52) The last three 1/104 00 [0.514, 0.5146) [0.514, 0.52) interval of the first 1/105 10 [0.5143, 0.51442) [0.514, 0.5146) The fifth 1/10 of the interval, The last three 1/107 01 [0.5143836, 0.514402) [0.514384, 0.51442) [0.514384, 0.51442) interval 4 1/10, starting from the first 1/10 8 Select a number from [0.5143876, 0.514402) as an output: 0.5143876 Step interval decoding symbol decoding decoder 1 [0.5, 0.7) 10 0.51439 Interval [0.5, 0.7) 2 [0.5, 0.5) 00 0.51439 at intervals [0.5 The first 1/103 [0.51439 of 0.7) 11 0.51439 at intervals [0.5, 0.52), the first 1 of the interval [0.514, 0.52) at intervals [0.514, 0.52). [0.514, 0.52) The 5th 1/106 [0.514384, 0.51442) 11 0.5143, 0.51442) 11 0.5143, 0.51442, [0.51439, 0.5143948) 01 (0.5143, 0.5144). 0.51439 1/108 decoded message at intervals [0.51439, 0.5143948): 10 00 11 00 10 11 01 Source symbol AI 1 2 3 4 probability PI P1 = 0.5 p2 = 0.25 p3 = 0.125 p4 = 0.125 Initial The encoding interval [0, 0.5) [0.5, 0.75) [0.75, 0.875) [0.875, 1) Page 8 40.609375 = 0.100111 (b). According to step 2, U3 = V3 = 0, U4 = V4 = 1, U5 = V5 = 1, but U6 and V6 are different, so that after transmitting 011, go to step 3. ... The symbol sent is: 10011 .... The final symbol encoded is the end symbol. Figure 4-04 Arithmetic coding concept In this example, the first bit of the arithmetic encoder accepts is "1", and its interval is limited to [0.5, 1), but there are 3 possible codes in this range. The characters 2, 3 and 4, so the first bit does not contain sufficient decoding information. After receiving the second bits, it becomes "10", which falls in the interval between [0.5, 0.75), and since the two indicated symbols points to the interval starting, it can determine the first symbol 2. The decoding situation after accepting each information is shown in Table 4-08. Table 4-08 Decoding Process Table In the above example, we assume that the encoder and decoder know the length of the message, so the decoder's decoding process will not run without restriction. In fact, a dedicated terminator is required in the decoder, and the decoding is stopped when the decoder sees the terminator. Several questions that need to be noted in the arithmetic code: (1) Since the accuracy of the actual computer is not unlimited, overflowing is an obvious problem, but most machines have 16, 32 or 64-bit precision. Therefore, this problem can be solved using a scale zoom method. (2) The arithmetic encoder generates only one codeword to the entire message, which is a real number in intervals [0, 1), so the decoder cannot decode before accepting all bits indicating this real number. (3) Arithmetic coding is also a coding method that is very sensitive to errors. If there is an error, it will cause the entire message to translate the wrong. Arithmetic coding can be static
Or adaptive. In static arithmetic coding, the probability of source symbol is fixed. In adaptive arithmetic coding, the probability of source symbols dynamically modify according to the frequency of symbols during encoding, estimating the process of estimating the source symbol probability during encoding. Modeling. The reason for developing dynamic arithmetic coding is because it is difficult to know the precise source probability, and it is unrealistic. When compressed messages, we cannot expect an arithmetic encoder to get the maximum efficiency, and the most effective way to do is estimated in the encoding process. Therefore, dynamic modeling is the key to determining the compression efficiency of the encoder. Accepted Digital Interregional Code Output 1 [0.5, 1) - 0.0 [0.5, 0.609375) 11 [0.5625, 0.609375) - 1 [0.59375, 0.609375) 3 ... ... page 9 44.3 RLE encoding An image often contains a number of colors with the same color. In these tiles, many rows have the same color, or have a number of consecutive pixels on one line have the same color value. In this case, it is not necessary to store the color value of each pixel, but only the color value of one pixel is stored, and the number of pixels having the same color may, or store the color value of one pixel, and rows with the same color value. number. Such compression coding is called stroke coding, RLE, has the same color and is a continuous number of pixels called the stroke length. Assume that there is a grayscale image, the pixel value of the nth line is shown in Figure 4-05: Figure 4-05 RLE encoding the code obtained by RLE encoding method: 80315084180. The number represented by the black body in the code is the length of the stroke, and the number behind the black body represents the color value of the pixel. For example, the black body 50 represents the same color value for 50 pixels, and its color value is 8. The number of code before and after the comparison RLE encoding can be found that 73 codes are used to represent the data of this line before encoding, and as long as the encoding represents the original 73 code, the amount of data before and after compression is about 7 : 1, that is, the compression ratio is 7: 1. This shows that RLE is indeed a compression technology, and this coding technology is quite intuitive and very economical. When decoding, according to the same rules used in encoding, the data obtained after the reduction is identical to the data before the compression. How big is the compression ratio that RLE can obtain, mainly depends on the characteristics of the image itself. If the image block having the same color in the image, the less the number of image blocks, the higher the compression ratio. Conversely, the smaller the compression ratio. RLE compression coding is especially suitable for computer generated images, which is very effective for reducing the storage space of the image file. However, RLE is not from the heart, and the continuous pixels having the same colors in the same line are often small, and the number of consecutive lines having the same color value will be less. If the RLE encoding method is still used, not only the image data is not compressed, but it may make the original image data becomes larger. Please note that this does not say that the RLE encoding method does not apply to the compression of natural images. In contrast, there is still little RLE in the compression of the natural image, but it is not only possible to use RLE a coding method, need and other compression coding technology. Combined. 4.4 Dictionary Codes There are many occasions, and do not know the statistical characteristics of data to encode data, and do not necessarily know their statistical features in advance. Therefore, people have proposed a lot of data compression methods to obtain maximum compression ratio as much as possible. These technologies are collectively referred to as universal coding techniques. Dictionary Encoding technology belongs to this. 4.4.1 Dictionary Encoding of Dictionary Encoding is the data itself contains the characteristics of repetition code. For example, text files and grating images have this feature. There are many types of dictionary encoding methods, and there are two categories in summary. The first class dictionary algorithm is to find out if the compressed character sequence has appeared in the previously entered data, and then the output is merely a string of the early appearance.
Pointer. This coding is shown in Figure 4-06. Page 10 4 4-06 Concept of the first class of dictionary coding Here, the "dictionary" refers to the previously processed data to represent the encoding process. The repeated part of this. This type of encoding algorithm is based on Abraham Lempel and Jakob ZIV, which is known as the LZ77 algorithm known as the LZ77 algorithm in 1977, for example, from Storer and Szymanski improved by Storer and Szymanski. Class 2 The Dictionary Algorithm is an attempt to create a "Dictionary of the Phrases" from the input data. This phrase is not necessarily a specific meaning phrase, which can be a combination of any character. In the encoding process, it is already in the dictionary. When the "phrase" appears, the encoder outputs the "index number" of the phrase in this dictionary, not the phrase itself. This concept is shown in Figure 4-07. Figure 4-07 Class 2 Dictionary Method Coding concept J. Ziv and A.LEMPEL published an article on this coding method for the first time. On the basis of their research, Terry A.WELTCH issued an article improving this coding algorithm in 1984, so this coding method To the LZW (Lempel-Ziv Walch) compression, this algorithm is first applied on the high-speed hard disk controller. 4.4.2 LZ77 algorithm To better illustrate the principle of the LZ77 algorithm, first introduce several terms used in the algorithm: (1) Input stream: To be compressed character sequence. (2) Character: Enter the basic unit in the data stream. (3) Coding position: Enter the current data stream The encoded character position refers to the start character in the forward buffer. (4) Forward buffer memory: Store a memory from the encoded position to the input data stream ends. (5) Window: Refers to a window containing W characters, the character starts from the encoding position, which is the last processed w character. (Sliding window) (6) pointer: Point to the start position of the matching string in the window and Length of the length. The core of the LZ77 encoding algorithm is to find the longest matching string from the start of the buffer memory. The specific execution steps of the encoding algorithm are as follows: Page 11 4 (1) Set the encoding position to the input data stream. Start position. (2) Find the longest match string in the window. (3) output in the format of the "(Pointer, Length) Character", where Pointer is a pointer to matching strings in the window, and Length represents a matching character. Length, Characters is the unmatched first word in forward buffer memory symbol. When there is no match, output "(0, 0) Character" (4) If the forward buffer is not empty, the encoding position and window move forward (Length 1) characters, then return to step 2 . [Example 4.4] The data to be encoded is as shown in Table 4-09, and the encoding process is shown in Table 4-10. The description is made as follows: (1) "Step" column represents the coding step. (2) "Location" column represents the encoding position, the first character in the input data stream is the encoding position 1. (3) "Match" column indicates the longest matching string found in the window. (4) The "Character" column indicates the first character in the forward buffer after the match is matched. (5) The "Output" column is output in "(back_chars, chars_length) explicit_character" format. Among them, (back_chars, chars_length) is a pointer to matching strings, telling the decoder "In this window, the back_Chars character is then copied by chars_length characters to the output," Explicit_Character is a real character. For example, the output "(5, 2) c" in Table 4-10 tells the decoder to fall back to 5 characters, then copy 2 characters "ab" table 4-09 to the data flow table 4-10 encoding process 4.4.3 LZSS Algorithm LZ77 Solved in the window by output real character
There is no matching string in the mouth, but this solution contains redundant information. Redundant information is expressed in two aspects, one is an empty pointer, and the other is the character of the encoder output may be included in the next matching string. The LZSS algorithm solves this problem with a more effective way, and the idea is if the length of the matching string is longer than the length of the pointer itself (the minimum matching string length), otherwise the real character is output. Since the output compressed data stream contains a pointer and character itself, an additional sign bit is needed to distinguish it, ie the ID bit. The specific execution steps of the LZSS coding algorithm are as follows: (1) Place the encoded position in the start position of the input data stream. (2) Find the longest matching string 1 Pointer: = match string pointer in the forward buffer memory. 2 Length: = Match string length. (3) Determine if the matching string length Length is greater than or equal to the minimum matching string length (length ≥min_length), if "yes": output the pointer, then move the encoding position forward the Length characters. Location 1 2 3 4 5 6 7 8 9 Character AABCBBABC Step Position Match String Character Output 1 1 - A (0) A2 2 AB (1, 1) B3 4 - C (0, 0) C4 5 BB ( 2, 1) B5 7 ABC (5, 2) C Page 12 4 If "NO": Outputs the first character in the forward buffer, then move the encoding position forward one character. (4) If the forward buffer reservoir is not empty, it returns to step 2. [Example 4.5] The encoding string is shown in Table 4-11, and the encoding process is shown in Table 4-12. The following description is as follows: (1) "Step" column indicates the encoding step. (2) "Location" column represents the encoding position, the first character in the input data stream is the encoding position 1. (3) "Match" column indicates the longest matching string found in the window. (4) The "Character" column indicates the first character in the forward buffer after the match is matched. (5) The output of the "Output" column is: This pointer tells the decoder "in this window to back_CHARS characters and copy Chars_length characters to the output." 2 If the length of the string itself is length ≤ min_length, the real match string is output. Table 4-11 Enter the data flow table 4-12 encoding process (min_length = 2) In the same computing environment, the LZSS algorithm is relatively high compression ratio than LZ77, and the decoding is equally simple. This is why this algorithm becomes the basis of developing a new algorithm, and many subsequent document compression programs have used LZSS ideas. For example, PKZIP, ARJ, LHARC, and ZOO, etc., the difference is only different from the length of the pointer and the size of the window. LZSS can also be used in conjunction with entropy coding, such as ARJ is incorporated with Hoffman, while PKZIP is coupled to Shannon-Fano, and its subsequent versions are also Hoffman codes. 4.4.4 LZ78 Algorithm Before describing the LZ78 algorithm, first describe several terms and symbols used in the algorithm: (1) Charstream: The data sequence to be encoded. (2) Character: Basic data unit in the character stream. (3) Prefix: Character sequence before a character. (4) String: Prefix characters. (5) Code Word: The basic data unit in the code word stream represents a string of characters in the dictionary. (6) Codeflow (CODESTREAM): The sequence of codewords and characters is the output of the encoder. (7) Dictionary: a buckle table. According to the index number in the dictionary, the string (stri)
NG) Specify a codeword (Code Location 1 2 3 4 5 6 7 8 9 10 11 Character AABBBBBAABC Step Position Match String Output 1 1 - A2 2 A A3 3 - B4 4 B B5 5 - C6 6 BB (3 2) 7 8 AAB (7, 3) 8 11 CC Page 13 4WORD). (8) Current prefix: Used in the encoding algorithm, refers to the prefix currently processed, indicated by symbol P. (9) Current Character: Used in the encoding algorithm, refers to the character after the current prefix, is represented by symbol C. (10) COURRENT CODE WORD: Used in the decoding algorithm, refers to the currently processed codeword, with W represents the current codeword, string.w represents the current codeword's embellister. 1. The encoding idea of the encoding algorithm LZ78 is constantly extracting a new embellished, string (String), popularly understood from the character stream, and then uses "code" is also a code word (Code Word). This "entry". In this way, the encoding of the character stream becomes the Code Word to replace the word stream, generating a code-flowestream, thereby achieving the purpose of compressed data. At the beginning of the encoding, the dictionary is empty, does not contain any embellished - string. In this case, the encoder outputs a first character c representing a special codeword (such as "0") and character stream, and adds this character C to the dictionary as one by one A string (String) of a character consisting. During the encoding process, if a similar case occurs, it is also possible. After some embellishment-string (String), if "Current Prefix P Current Character C" is already in the dictionary, use character c to extend this prefix, such extension operations have been repeated to get one There is no prefix-string (String) in the dictionary. At this point, the code word and character c indicating the current prefix P are output, and P C is added to the dictionary, and then the next prefix in the Charstream is started. The output of the LZ78 encoder is a code word-character (W, c) pair, each output a pair of codeword flows, and expands with the character C to be compatible with the codeword W, String, generate new Proud, then added to the dictionary. The specific algorithm for LZ78 encoding is as follows: Step 1: At the beginning, the dictionary and the current prefix P are empty. Step 2: Current Character C: = The next character in the character stream. Step 3: Determine if P C is in the dictionary: (1) If "Yes": Use C to expand P, let P: = P C; (2) If "No": 1 output is corresponding to the current prefix P Codewords and current characters C; 2 Add strings P C to the dictionary. 3 order P: = null value. (3) Judgment whether there is also a character in the character stream that needs to be encoded. If "Yes": returns to step 2. 2 If "NO": If the current prefix P is not empty, the output corresponds to the code word corresponding to the current prefix P, and then ends the encoding. 2. The decoding algorithm decoded the dictionary when the decoding start is empty, which will be refactored from the codeword flow during the decoding process. Whenever read a pair of codewords-characters (W, c) in the code word, the codeword refers to the prefix-string already in the dictionary, and then the current code word is caught string string.w. And character c is output to a Charstream, and add a string (String.W C) to the dictionary. After the end of the decoding, the dictionary of the reconstructed dictionary is identical to the dictionary generated when encoding. The specific algorithm decoded by LZ78 is as follows # xff1
a; Step 1: The dictionary is empty at the beginning. Step 2: The current codeword W: = the next codeword in the code word stream. Step 3: Current characters C: = Characters followed by the codeword. Step 4: Output the current codeword's puff-string (string.w) to Charstream, then output character C. Step 5: Add String.W C to the dictionary. Step 6: Determine if there is still a codeword in the code word flow to translate Page 14 4 (1) If "Yes", return to step 2. (2) If "NO" is ended. [Example 4.6] The encoding string is shown in Table 4-13, and the encoding process is shown in Table 4-14. The following description is as follows: (1) "Step" column indicates the encoding step. (2) "Location" column indicates the current location in the input data. (3) "Dictionary" column indicates the index of the embellishment, but-string of strings added to the dictionary, equal to the "step" sequence number. (4) The "Output" column is output in the form of (W, C) in (current code word W, current character c). Table 4-13 The encoding string table 4-14 The maximum advantage of LZ78 is the maximum advantage of the LZ78 in each encoding step, and the compression ratio is similar to LZ77. 4.4.5 LZW Algorithm The term used in the LZW algorithm is the same as LZ78, and only one term-prefix root (root) is added, which is a String of a single character. In the principle of encoding, LZW has the following differences compared to LZ78: 1 LZW outputs only the codewords representing the dictionary in the dictionary (Code Word). This means that the dictionary cannot be empty at the beginning, it must contain all a single character that may appear in the character stream, that is, the prefix root (root). 2 Since all possible individual characters are included in the dictionary, one character prefix (OneCharacTerPrefix) is used each time the coded start is used, so there are two characters in the first complex-string in the dictionary. The LZW coding algorithm and decoding algorithm are now described below. 1. The encoding algorithm LZW encoding is done around the conversion table called the dictionary. This conversion table stores a character sequence called prefix and assigns a codeword (Code Word) for each entry, or serial number, as shown in Table 4-15. This conversion table actually expands 8-bit ASCII character sets, and the increased symbol is used to represent the variable length ASCII string that appears in the text or image. The expanded code can be represented by 9, 10, 11, 12 or more bit. 12 digits in Welch, 12 digits can have 4096 different 12-digit code, which means that the conversion table has 4096 entries, of which 256 tablets are used to store defined characters, with 3,840 The entry is used to store the prefix (Prefix). Table 4-15 Dictionary Location 1 2 3 4 5 6 7 8 9 Character ABBCBCABA Step Position Dictionary Output 1 1 A (0, A) 2 2 B (0, B) 3 3 Bc (2, C) 4 5 BCA (3 , A) 5 8 BA (2, a) Code Word Prefix (Prefix) 1 page 15 4LZW Encoder (Software Encoder or Hardware Encoder) is the conversion between input and output by managing this dictionary . The input of the LZW encoder is a character stream. The character stream can be a string composed of 8-bit ASCII characters, and the output is a code stream represented by n bits (e.g., 12 bits), code words represent a single Characters or multiple characters consisting of strings. The LZW encoder uses a very practical analysis algorithm called Greedy Parsing Algorithm. In the greedy analysis algorithm, each analysis must check the string from the Charstream in serialization, disband the longest string that has been identified, that is, the longest prefix that has appeared in the dictionary ( Prefix). Be known before known
PREFIX plus the next input character C is also the current character (Current Character) as the prefixed extended character, forming a new extended string-buckle (String): Prefix C. Does this new puff-string (String) should be added to the dictionary, but also to see if there is a dictionary in the dictionary. If so, then this embellished string (String) turns into a prefix, continues to enter a new character, otherwise write this buckle-string (String) to the dictionary to generate a new prefix (Prefix), And assign a code. The specific implementation steps of the LZW coding algorithm are as follows: Step 1: The dictionary at the beginning contains all possible roots (root), and the current prefix P is empty; Step 2: Current character (c): = the next character in the character stream Step 3: Understand whether the embellished - string P C is in the dictionary (1) if "yes": p: = p c // (with C expand P); (2) If "no" 1 puts the representative The prefix P's codeword output to the code word stream; 2 embellishment-string P C added to the dictionary; 3 let order P: = C // (now P only contains a character C); Step 4: Judgment the code flow Whether there is a code word (1) If "Yes", return to step 2; (2) If "No" 1 outputs the code word representative of the current prefix P to the code stream; 2 end. The LZW encoding algorithm is available in pseudo code. At the beginning, it is assumed that the encoding dictionary contains several unified single codewords. For example, 256 characters can be represented by a false code: ... 193 A194 B ... 255 ... 1305 abcdefxyf01234 ... Dictionary [J] ← All n SINGLE-Character, J = 1, 2, ..., NJ ← N 1PREFIX ← Read First Character in CharstreamWhile ((C ← Next Character)! = NULL) Page 16 42. Decoding algorithm LZW Decoding algorithm is used in another two terms: 1 Code word (Current Code Word) : Refers to the code word that is currently being processed. It is represented by CW, using string.cw to indicate the current inquiry-string; 2 previous codeword: Instead, in the current codeword, use PW, use String .pw represents a first prick-string. At the beginning of the LZW decoding algorithm, the decoding dictionary is the same as the coding dictionary, which contains all possible prefixes (ROOTS). The LZW algorithm remembers the previous codeword (PW) during the decoding process, and outputs the current infix string string.cw after reading the current codeword (CW) in the code word stream, and then uses the first String.cw's first Character extension prodes - string string.pw Add to the dictionary. The specific implementation steps of the LZW decoding algorithm are as follows: Step 1: Dictionary in the dictionary contains all possible prefix roots (root) when the coded is started. Step 2: CW: = The first codeword in the code word stream. Step 3: Output the current infix - string string.cw to the code stream. Step 4: Previous codeword PW: = Code CW. Step 5: The current codeword CW: = the next codeword in the code word stream. Step 6: Judging whether the currentfix - string string.cw is in the dictionary (1) If "Yes", then: 1 output the fronture-string string.cw to the character stream. 2 Add the first character C of the Presuce-string String.PW Current prefix - string string.cw to the dictionary. (2) If "No", & # xff1a