Garbled algorithm
sequence
I believe that friends who have been online have met "garbled", which is the unrecognizable characters that appear when browsing the web or seeing email. There have been many articles in the past, "garbled", but their articles just spend how to distinguish and use tool decoding, and have not detailed the implementation of the algorithms of various codes, this article will be the most commonly used coding coding on the Internet. The decoding algorithm is elaborated in detail. I hope to have some reference value for those who want to understand the "garble" algorithm or want to implement these functions in their own programs. The source program of this article writes in C language, forms a function, can be used directly.
I. Common Coding 1. UuEncode UUENCODE is to encode binary in text files to facilitate one of the encoding methods of transmission / exchange of binary files based on a text transport environment. Use frequencies in the mail system / binary news group. It is relatively high, often used for Attach binaries. This feature of this coding is: Each line starts with the "M" flag. Here is a test file mogao.txt I did, encoded as uuencode: begin 644 mogao.txtm "0D) (" `@ (* & vpm " z / ocmzbt // bkh; <- "@ g7]] 7? .FUO9V% ohzrpu] 3 & n /: zmu] 6 ^ hzat96QN970Z R / R, # (N, 3 $ R C (p c $ s, chr, z.il / g4l: & c # 0h) ("` @ M ("` @ q * jxw / cmo / zyi-? WRM * cnfat = '`z r] m; v = a; ryb96yt: 75n fye =` t * m "0D) 16UA: 6QT; SIM; V = A; T`ss $ n; f5t # 0h) ("` @ * bhj * bhj * bhj * bhj * bhjm * bhj * bhj * bhj * bhj * bhj * bhj * bhj * bhj * bhj * bhj ("` @ ("` @ ("` @ ("` @ ("` @ M # 0H) ("` @ * b "s _ <' O,? 2Y, JRP [2VO * [m / c7wz.ll_w! R] ? Ca * / * ll.tmkrrmn / 'ts / ("j # 0h) (" `@ * bhj * bhj * bhj * bhj * bhj * bhj * bhj * bhj * bhj * bhj * bhj, * bhj * bhj * bhj * BHJ`End
You can separate it separate into a file: Mogao.uue, then open with Winzip, decompress Mogao.txt. UUENCODE algorithm is simple, it is encoded which put three characters in the order of 24-bit buffers, and makes up the fault to make up the buffer to become 4 parts, the high is in the first, 6 bits per part, Re-represented by the following 64 characters: "`! "# $% & '() * , -. / 0123456789:; <=>? @ AbcdefghijklmnopqrStuvwxyz [/] ^ _" In the beginning of the file, "Begin XXX is The encoded file name "," end "in the end of the file, used to sign the beginning and end of the UUE file. When encoding, each read the 45 characters of the source file, less than 45" NULL "to make up for 3 Integer times (such as: 23), then enter a target file an ASCII as: "32 actual read character number" as the beginning of each line. After reading the character encoded, enter the target file, and then enter A "wrap". If the source file is encoded, then "` (ASCII is 96) "and a" wrap "representation of the encoding. Decoding it converts 4 characters to 4 6-digit characters, Intercepting the useful post-six bits to put a 24-bit buffer, which is 3 binary code. I will give UuEncode encoding and decoded C language description: / * uuencode encoded * / void uue (unsigned char chasc [3] , unsigned char chuue [4]) / * chasc: Uncoded binary code Chuue: encoded UUE code * / {INT i, k = 2; unsigned char t = null; for (i = 0; i <3; i ) {* (chuue i) = * (chasc i) >> K; * (chuue i) | = t; if (* (chuue i) == null) * (chuue i) = 96; ELSE * (Chuue I) = 32; T = * (CHASC I) << (8-K); T >> = 2; K = 2;} * (Chuue 3) = * (Chasc 2) & 63; if (* (chuue 3) == NULL) * (Chuue 3) = 96; Else * (Chuue 3) = 32;
/ * Uuencode decoding * / void unus (unsigned char chas [4]) / * chuue: unresolved UUE code chasc: decoded binary code * / {INT i, k = 2; unsigned Char t = null; if (* chuue == 96) * chuue = null; else * chuue- = 32; for (i = 0; i <3; i ) {* (chasc i) = * (chuue i << k; k = 2; if (* (chuue i 1) == 96) * (chuue i 1) = null; else * (chuue i 1) - = 32; t = * (Chuue i 1) >> 8-k; * (Chasc i) | = t;}} 2. xxencode is not possible without mentioning xxencode, xxencode's encoding algorithm and uuencode basically, but use is Different character sets. The characters used by the XXencode encoding are: " - 0123456789abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz" Compared with UuEncode, its special characters are less. Many tools that support UUENCODE codec are simultaneously supporting XXencode. This feature of this coding is that each line starts with the "H" flag. The following is an example of Xxencode: begin 644 mogao.txth0EY760 U684qkh90uwjXhuWowwWfcPQB0UbLxxLTCapjNq3jcumkpxH4iwOuhpxKycuVoNKliNLEu9mwmA16iAH2m9X6k9X2nAXcmAuCdgwbIgO4X1Ec760 Uh60 Ul8esrwXhjDutdBTrmh8XiaVoR5 u9mxhPqRVPmtWNKtoOLJi9atZR o8h0EY7FKpVOKloPndhPqRVPo nBn2iPaJo1Ec760 U8Wce8Wce8Wce8Wce8Wceh8Wce8Wce8Wce8Wce8Wce8Wce8Wce8Wce8Wce8Wce60 U60 U60 U60 U60 Uh1Ec760 U8W0nzQ59jATGtAemkvGqj98vhDXLruCggzr-mxTXj8D8ggCohfmmhiw5onw6e1Ec760 U8Wce8Wce8Wce8Wce8Wce8Wce8Wce8Wce8Wce8Wce8WceA8Wce8Wce8Wce8Wce end
You can separate it into a file: Mogao.xxe, then open with Winzip, decompress Mogao.txt. The encoding algorithm of xxencode and UUENCODE are basically the same, and it is more simple to implement, it will not be detailed here. Next, XXencode encoding and decoding C language description: / * xxencode encoded * / void xxe (unsigned char chasc [3], unsigned char chxxe [4]) / * chasc: Uncoded binary code CHXXE: encoded XXE Code * / {INT i; static char set [] = " 0123456789abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"; chxxe [0] = chasc [0] >> 2; chxxe [1] = (chasc [0] << 4) & 48 | (chasc [1] >> 4) & 15; chasc [1] << 2) & 60 | (chasc [2] >> 6) & 3; chxxe [3] = Chasc [2] & 63; for (i = 0; i <4; i ) chxxe [i] = set [chxxe [i]]; / * Chart * /} / * Tell, the first letter in the text section of the XXencode file is: After the number of ASCII values that actually read the number of characters actually read from the source file, the Six is obtained after the SET [] checksum. * // * xxencode decoding * / unsigned char set / * Chart function * / {if (CH == 43) CH = 0; ELSE IF (CH == 45) CH = 1; ELSE IF CH> = 48 && ch <= 57) CH- = 46; ELSE IF (CH> = 65 && Ch <= 90) CH- = 53; ELSE IF (CH> = 97 && ch <= 122) CH- = 59; Return Ch;} Void Unxxe (unsigned char chasc [3]) / * chxx: unresolved XXE code chasc: decoded binary code * / {INT K = 2, I; unsigned char t; t = null; * chxxe = set (* chxxe); for (i = 0; i <3; i ) {* (chxxe i 1) = set (* (ch XX i 1)); (ChHEX i) = * (CHXXE I) << k; k = 2; t = * (CHXXE I 1) >> 8-K; * (ChHEX I) | = T;}}
3. Base64 base64 and the quoted-printable will be included in the MIME (multi-part), a coding standard for the MIME (Multi-Part), multimedia email, and WWW hypertext, for transmitting non-text data such as graphics, sound and fax. ). MIME is defined in RFC1341. Base64 is now a maximum of a company today, almost all email software headeds it as a default binary code, which has become synonymous with today's email coding. Here is an example of Base64, from example, you can see the closely Base64 and e-mail are: Content-Type: text / plain; charset = "cn-gb" Content-Transfer-Encoding: BASE64CQkJICAgIKG2wtLC68vjt6i088irobcNCgnX99XfOm1vZ2Fvo6yw19TGu8a619W o6h0ZWxuZXQ6Ly8yMDIuMTEyLjIwLjEzMjoyM6Ops8nUsaGjDQoJICAgICAgxKq438jtvP65pNf3ytKjumh0dHA6Ly9tb2dhby5iZW50aXVuLm5ldA0KCQkJRW1haWx0bzptb2dhb0AzNzEubmV0DQoJICAgKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqICAgICAgICAgICAgICAgDQoJICAgKiCz / CHLVMFS5MQYW7S2VLK7TPJX36OSS / 3BY9FJVKPKSSO0TRYYU8H0Z8IQDQOJICAGKIOQKIOQKIOQKIOQKIOQKIOQKIOQKIOQKIOQKIOQKIOQKIOQKIOQKIOQKIOQQKIOQKIOQKIOQKIOQKIOQQKIOQKIOQKIOQKIOQKIOQ
You can separate it into a file, you can name: mogao.eml, double-click to open with Outlook (the original information of the first two behaviors, starting from the fourth line as coding content). Base64 algorithm is close to UUENCODE algorithm, it is also very simple: it puts the character flow order into a 24-bit buffer, and makes a missing place to make up. Then the buffer is truncated into 4 parts, the high position is first, each of the 6 digits, and the following 64 characters are reprecired: "AbcdefghijklmnopqrStuvwxyzabcdefghijklmnopqrStuvwxyz0123456789 /". If the input is only one or two bytes, then the output will be allowed to make up with the equal sign "=". This can partition the additional information causes the encoding confusion. It is generally 76 characters per row. Below I give the Base64 encoding and decoding C language description: / * base64 code * / void base64 (unsigned char chasc [3], unsigned char chuue [4]) / * chasc: uncoded binary code Chuue: encoded Base64 code * / {INT i, k = 2; unsinged char t = null; for (i = 0; i <3; i ) {* (chuue i) = * (chasc i) >> K; * (Chuue i) | = T; T = * (CHASC I) << (8-k); T >> = 2; k = 2;} * (chuue 3) = * (chasc 2) & 63 ;
For (i = 0; i <4; i ) IF ((Chuue I)> = 0) && (* (chuue i) <= 25)) * (chuue i) = 65; Else IF ((* (chuue i)> = 26) && (* (chuue i) <= 51)) * (chuue i) = 71; Else IF ((* (chuue i)> = 52) && (* (chuue i) <= 61)) * (chuue i) - = 4; else if (* (chuue i) == 62) * (chuue i) = 43; else if (* (chuue i) == 63) * (chuue i) = 47;} / * base64 decoding * / void unbase64 (unsigned char chuue [4], unsigned char chasc [3]) / * chuue: unresolved Base64 Code Chasc : Decoded binary code * / {INT i, k = 2; unsigned char t = null; for (i = 0; i <4; i ) IF ((* (chuue i)> = 65) && (* (chuue i) <= 90)) * (chuue i) - = 65; Else IF ((* (chuue i)> = 97) && (* (chuue i) <= 122)) * (chuue i) - = 71; Else IF ((* (Chuue i)> = 48) && (* (* (chuue i) <= 57)) * (chuue i) = 4; Else IF (* (chuue i) == 43) * (chuue i) = 62; else if (* (chuue i) == 47) * (chuue i) = 63; Else if (* (chuue i) == 61 ) * (chuue i) = 0;
For (i = 0; i <3; i ) {* (chhex i) = * (chuue i) << k; k = 2; t = * (chuue i 1) >> 8-k; * (ChHEX I) | = T;}}
4. Quoted-Printable Quoted-Printable is referred to as QP, which is generally used in the Email system. It is usually used for the 8-bit character of a small amount of text, such as foxmail, uses it to do coding for the subject and the alias. This coding should be well identified: it has a lot of "=". Here is an example of it:
Mime-Version: 1.0content-transfer-encoding: quoted-printable
= A1 = b6 = C2 = D2 = C2 = EB = CB = E3 = B7 = A8 = B4 = f3 = c8 = AB = a1 = b7 = D7 = f7 = D5 = DF: Mogao = A3 = AC = B0 = D7 = D4 = C6 = BB = C6 = BA = D7 = D5 = BE = a3 = a8telnet: //202.112.20.132: 23 = a3 = a9 = b3 = c9 = D4 = B1 = a1 = a3 = C4 = aa = b8 = DF = C8 = ED = BC = Fe = B9 = A4 = D7 = f7 = ca = D2 = a3 = Bahttp: //mogao.bentiun.net emailto: mogao@371.net ********* ********************************************** * = B3 = fd = C1 = CB = BC = C7 = D2 = E4 = CA = B2 = C3 = B4 = B6 = BC = B2 = BB = B4 = f8 = D7 = DF = A3 = AC = B3 = fd = C1 = CB = D7 = E3 = BC = A3 = CA = B2 = C3 = B4 = B6 = BC = B2 = BB = C1 = f4 = Cf = C2 * *************************************** **************** You can separate it into a file, name: mogao.eml, double-click to open it with Outlook (original information of the first two behaviors, from The fourth line begins to encode content). QP's algorithm can be said to be the easiest to say that the coded efficiency is the lowest (its coding rate is 1: 3), which is specifically to handle 8-bit characters. Its algorithm is: read a character, if the ASCII code is greater than 127, that is, the character's top 8 is 1, the encoding is performed, otherwise it is ignored (sometimes 7-bit character encoding). The code is very simple, see the following C language description: / * QP code * / void QP (unsigned char SECOND) / * SOUR: Character first to encode: First after encoding Character SECOND: The second character first and second characters first and second are returned * / {if (SOUR> 127) {first = SOUR >> 4; second = source & 15; if (first> 9) first = 55; Else First = 48; IF (Second> 9) Second = 55; Else Second = 48; Printf ("% C% C% C", '=', first, second);}}
/ * QP decoding * / void uqp (unsigned char Sour, unsigned char first, unsigned char second) / * source: the first character of the QP code after the QP code first character SOUR is returned Value * / {if (first> = 65) first- = 55; else first- = 48; if (second> = 65) second- = 55; else second- = 48; Sour = NULL; SOUR = first << 4 Sour | = second;} Now everyone knows why qp coding rate is so low! See RFC2045 for detailed descriptions and accurate definitions of QP.
II. Chinese character encoding 1. GB code and BIG5 code GB code is a Chinese character encoding method used in countries and regions such as mainland China, Singapore. The BIG5 code is a Chinese character encoding method used in Taiwan Province. Their coding method is completely different, and the conversion between them can only be done by the "Characterization Method". Therefore, the method of conversion is simple, difficult is "table" generation. Many articles have been introduced here, I will not be detailed here. In my homepage, I have the source of "Chinese Characters Transcodent V1.0" I wrote, with these two "tables", which can be used directly.
2. The HZ code Hz code is to enable the encoding defined by the mail server or gateway that can only transmit 7 bit information, and is the code defined by the Chinese commonly coded. It and the quoted-printable code described above can only be encoded, ie the control character is ignored when encoding. This coding is also very well recognized: there are many "~ {" and "~}", and always appear. Here is an example of Hz code: ~ {! 6brbkkc7 (4SH ! 7 ~} ~ {WWU_ ~}: Mogao ~ {#, 0WTF; f: wu> # (~} telnet: //202.12.20.132: 23 ~ { #} ~ {D * 8_hm <~ 9 $ wwjr #: ~} http://mogao.bentiun.net emailto: mogao@371.net ************* ****************************************** * ~ {3} AK You can open the "Antarctic Star" to see this text. Its algorithm is simpler: read a character, if it is an 8-bit character, turn its highest bit clear. Enclose the output of the continuous 8-bit character to zero zero to "~ {" and "~}". When decoding: That is "1" "1" in the 8th position of the part of "~ {" and "~}". The conversion between the three codes described above is often met. I wrote "Chinese character transcodent v1.0" can be easily converted between these three types, I am open to the source of netizens. Learn. III. Other commonly used coding 1. The most typical example in Unicode Unicode Application is: IE4 or later version of HTML encoding. It can be said to be the only character set under Windows. But it is still very imperfect, and WIN95 and WIN98 have also very limited support, and it does not even have a complete set of standards. However, Microsoft's latest office2000 and the Windows 2000 to be introduced immediately will fully support Unicode. Unicode replaces other codes will be an inevitable trend. However, in the past two years Unicode does not dominate, it is after the dominant position, because the operating system is different, other codes will not die immediately. Its Chinese information can be found in the documentation in Office2000 and Windows2000, and its official website is: http://www.unicode.org/. 2. Binhex Binhex encoding is a Macintosh computer (i.e., commonly known as "Apple Computer") with an encoding method that represents / transmits binary files with printable characters. Its main use is the attach binaries in the email program. Most email programs do not support this format (Eudora support), but use WinZip to decode. Its information Please consult the relevant documentation of the Macintosh computer. Third. Summary Due to the limitations of the space, there are still many codes in addition to these common codes described in this paper. "Garblered" as the various encryption algorithms (I will introduce in another article). All documents and source programs mentioned herein can be downloaded in my homepage, my home address is: http://mogao.bentium.net. If you have any comments on this article, please come to Xin, my E-mail address is: mogao@371.net. Note: The contents of the examples I use in this article "mogao.txt" is: "hash algorithm Daquan" Author: mogao, Baiyun Huang station (telnet: //202.112.20.132: 23) members. Mogao Software Studio: http://mogao.bentiun.net emailto: mogao@371.net ************************************** **************** * In addition to memory, don't leave anything, except for the footprints, no left * *************** ****************************