Introduction to MIME Coding Method
Subject: =? GB2312? B? Xoo6w6oh? =
Here is the subject of the email, but because of the encoding, we can't see anything, its original text is: "Hello!" Let's first look at the two methods of MIME encoding.
The original reason for encoding the message is because many gateways on the Internet do not correctly transmit character characters in the 8 BIT, such as Chinese characters. The principle of the encoding is to convert the contents of the 8 BIT to 7 bits to properly transfer, and then restore it to the contents of 8 bits after receiving the receiver.
MIME is the abbreviation of "Multi-Used Internet Mail Extension Protocol, before the MIME protocol, the encoding of the mail has passed the UuEncode and other coding, but because the MIME protocol algorithm is simple, and easy to expand, it has now become the mainstream of the mail coding method, not only It is used to transmit the characters of the 8 BIT, or it can also be used to transmit binary files, such as images, audio, etc. in the mail attachment, and extend many MIME-based applications. From the encoding method, MIME defines two coding methods Base64 and QP (Quote-Printable):
Base 64 is a general method, which is simple, which means four Byte's data in 4 byte, so that in these four BYTE, only the previous 6 bit, this does not exist only The problem of transmitting 7bit characters. The abbreviation of Base 64 is generally "B", and the Subject in this letter is the base64 encoding.
Another method is the QP (Quote-Printable method, typically abbreviated as "Q" method, whose principle is to indicate a 8 bit of characters in two 16-based value, and then add "=" in front. So we see the QP encoded files usually: = B3 = C2 = BF = A1 = C7 = E5 = A3 = AC = C4 = FA = BA = C3 = A3 = A1.
In PHP, there are two functions that have two functions that can be easily achieved: base64_decode () with quoted_printable_decode (), the former can be used for the base64 encoded decoding, the latter is a decoding for the QP encoding method.
Now let's take a look at Subject: =? GB2312? B? Xoo6w6oh? = The content of this topic, this is not a complete code, only part is encoded, this part is =?? = Two markers, = • The character set of this paragraph is GB2312 later, and then one base64 encoding is used. Through this analysis, let's take a look at this MIME decoded function: (This function is provided by the phpx.com stationmaster Sadly, I put it in a class and make a small amount of modification, thank you here)
Function decode_mime ($ String) {
$ POS = STRPOS ($ String, '=?');
IF (! is_int ($ pOS)) {
Return $ String;
}
$ preceding = Substr ($ String, 0, $ POS); // Save Any Preceding Text
$ Search = Substr ($ String, $ POS 2); / * The Mime Header Spec Says this Is The Longest a Single Encode Word Can Be * / $ D1 = STRPOS ($ SEARCH, '?');
IF (! is_int ($ d1)) {
Return $ String;
}
$ Charset = SUBSTR ($ String, $ POS 2, $ D1); // Remove the defined section of the character set
$ search = Substr ($ SEARCH, $ D1 1); // Part of the character set is defined in the section => $ SEARCH;
$ D2 = STRPOS ($ SEARCH, '?');
IF (! is_int ($ d2)) {
Return $ String;
}
$ Encoding = Substr ($ Search, 0, $ D2); Part of the part between two? Coding mode: q or B
$ search = substr ($ SEARCH, $ D2 1);
$ END = STRPOS ($ SEARCH, '? ='); // $ D2 1 and $ END are encoded: => $ endcoded_text;
IF (! is_int ($ END)) {
Return $ String;
}
$ encoded_text = SUBSTR ($ SEARCH, 0, $ END);
$ REST = SUBSTR ($ String, (Strlen ($ ENCEDING. $ ENCODED_TEXT) 6))); // 6 is the front removal = ???? = six characters
Switch ($ encoding) {
Case 'Q':
Case 'Q':
// $ encoded_text = STR_REPLACE ('_', '% 20', $ encoded_text);
// $ encoded_text = STR_REPLACE ('=', '%', $ encoded_text);
// $ decoded = urldecode ($ encode_text);
$ decoded = quoted_printable_decode ($ encoded_text);
IF (STRTOLOWER ($ Charset) == 'Windows-1251') {
$ decoded = Convert_CYR_STRING ($ decoded, 'w', 'k');
}
Break;
Case 'b':
Case 'b':
$ decoded = base64_decode ($ encoded_text);
IF (STRTOLOWER ($ Charset) == 'Windows-1251') {
$ decoded = Convert_CYR_STRING ($ decoded, 'w', 'k');
}
Break;
DEFAULT:
$ decoded = '=?'. $ charSet. '?'. $ encoding. '?'. $ encoded_text. '? =';
Break;
}
Return $ Preceding. $ DECODED. $ THIS-> DECODE_MIME ($ REST);
This function uses a recursive method to implement a decoding of characters containing the above Subject segment. Comments have been added in the program. I believe that people who have a little PHP programming basis can understand. This function is also called Base64_Decode () with the quoted_printable_decode () two system functions, but requires a large number of strings for the mail source file. However, PHP string operations can be considered in all languages. Final Return $ Preceding. $ Decoded. $ This-> decode_mime ($ REST); implement recursive decoding, because this function is actually put in a mime decoding class you want to introduce, so use $ this- > Decode_mime ($ REST) This form of calling method.
Let's take a text. Here is some header information of MIME, let's make a simple introduction (if reader is interested in learning more, please refer to MIME's official documentation).
MIME-VERSION: 1.0
Represents the version number of the MIME used, usually 1.0;
Content-Type: The type of text is defined. We actually know what type of files inside the body through this ID, such as: text / plain represents unformatted text body, Text / HTML HTML document, Image / GIF is expressed in the GIF format, and so on. In this paper, it is to be specifically shown that the composite types commonly used in the message are specifically described. MULTIPART type indicates that the body is composed of multiple parts, and the subtypes described later are the relationship between these parts. The three types used in the mail are available, Multipart / Alternative: Represents the body consisting of two parts, you can choose Any one of them. The main role is that when essays have a TEXT format and HTML format, you can select one of the two texts to display, and the message client software that supports HTML format generally displays its HTML body, without supporting it will show its text body. Multipart / Mixed: A plurality of parts indicating that the document is mixed, referring to the relationship between the Text and the attachment. If the MIME type of the message is multipart / mixed, that is, the mail is attached; Multipart / Related: Indicates that multiple parts of the document are related, generally to describe the HTML body and its related pictures.
These composite types can be nested, such as a message with attachment, and the text of HTML and Text format, the message structure is:
Content-Type: Multipart / Mixed
Part 1:
Content Type: Multipart / Alternative:
Text body;
HTML format text
Part 2:
annex
Email end of the email;
Since the composite type is composed of multiple parts, a separator is required to separate the plurality of parts, which is described above, Boundary = "---- = _ nextpart_000_0007_01c03166.5b1e9510" in the above mail source file, for each Contect Type: Multipart / * has such a description, indicating the separation between multiple parts, this separator is a combination of ancient characters that cannot appear in the body, in the document, "-" Plus this Boundary to indicate a part of the beginning of the document, in "-", add Boundary, add "-" to the end of the document. Since the composite type can be nested nested, there may be multiple Boundary in the message. There is also a most important MIME header:
Content-Transfer-Encoding: base64 It represents this part of the document encoding method, which is the base64 or QP (Quote-Printable) described above. We only have this instruction to decode it with the correct decoding method.