What's UTF-8?

xiaoxiao2021-03-06 84

Organizational: China Interactive Publishing Network (http://www.china-pub.com/) RFC Document Chinese Translation Plan (http://www.china-pub.com/compters/emook/aboutemook.htm )E-mail: Ouyang@china-pub.com Translator: Chen Jianhua (Chjh21 Chjh@263.net) Translation Time: 2001-10-15 Copyright: Chinese Interactive Publishing Network. Can be used for non-commercial use free reprint, but the translation and copyright information of this document must be retained.

Network Working Group F. YergeAurequest for Comments: 2279 Alis Technologiesobsoletes: 2044 January 1998category: Standards TRACK

A conversion format of UTF-8, ISO 10646 (RFC 2279 - UTF-8, a Transformation Format of ISO 10646)

The status of this memo This document describes the Internet standard tracking protocol of an Internet community, which requires further discussion and suggestions to improve. Please refer to the latest version of the Internet Formal Protocol Standard (STD1) to get the standardization and status of this protocol. The release of this memo is not restricted. Copyright Notice Copyright Affiliation INTERNET Community (1998), reserved all power. Abstract ISO / IEC 10646-1 defines a multi-8-bit character set, called a Universal Character Set (UCS), which contains most of the world's written character systems. However, more 8-bit byte characters are inconsistent with many current applications and protocols, resulting in some development of the UCS conversion format (UTF). Each UTF has a different feature. The UTF-8 in this memo retains all US-ASCII range characters, providing compatibility with analyzers and other software that rely on the US-ASCII value, and transparent to other character values. This memo is used to update and replace RFC 2044, and specialize in the relevant standard version of the relevant standards. Directory 1, Description 22, UTF-8 Definition 33, Standard Version 44, Example 45, MIME Registration 46, Safe Consideration 5 Near 5 Reference 5 Author Address 6 Copyright Description 7

1. Introduction ISO / IEC 10646-1 [ISO-10646] defines a multi-8-bit character set, called a Universal Character Set (UCS), which contains most of the world's written character systems. Two multi-8-bit byte coding have been defined, and four 8-bit byte encoded is called UCS-4 for each character, and two 8-bit byte encoded is called UCS-2. They can only be addressed by the top 64k characters of UCS, and there is currently no denomination in other parts of this range. It is worth noting that unified character encoding criteria [Unicode] defines the same character set, and it further defines an extra character attribute and other application details for the implementation, but no UCS-4 encoding is defined. Until now, Unicode changes and ISO / IEC 10646 fix each other, so their character instructions and coding allocation are synchronized. The relevant standard committee agreed to maintain this very useful synchronization. However, UCS-2 and UCS-4 codes are difficult to use in many current applications and protocols, these applications and protocol assume characters as an 8 or 7-bit byte. Even if new systems that can handle 16 bit characters, UCS-4 data cannot be processed. This situation leads to a development called UCS conversion format (UTF), each of which has different characteristics. UTF-1 is only important in history, which has been removed from ISO / IEC 1064. The UCS-7 has the nature of all BMP instructions that can be encoded only with only 8 bits bytes. Its highest bit is zero (other 7 bit is US-ASCII value, [US-ASCII]), is considered a message Safe coding ([RFC2152]). The UTF-8 object in this memo uses all of the 8-bit bytes, maintaining the nature of all US-ASCII range: US-ASCII characters use an 8-bit byte code, using the usual US-ASCII value, Therefore, any 8 bit byte under this value only represents a US-ASCII character without other characters. UTF-16 plans to convert a subset of converting the UCS-4 instruction from the reservation range to UCS-2 value pairs. UTF-16 affects UTF-8 because the UCS-2 value of the reservation range must be treated as a UTF-8 transformation. UTF-8 uses a changed 8-bit byte number of UCS-2 or UCS-4 character encoding. 8 Bacterby bytes, and the value of each byte depends on the integer value specified in this character in ISO / IEC 10646. This conversion format has the following characteristics (all values 16): - from 0000 0000 to 0000 007F (US-ASCII instruction) character value corresponding to 8-bit bytes of 00 to 7F (7-bit US-ASCII value) . This conclusion is that the normal ASCII string is still a valid UTF-8 string after conversion. -US-ASCII value does not appear in other UTF-8 encoded characters. This provides compatibility with file systems or other software (such as Printf () functions in the C library, facilitating the parser to resolve the US-ASCII value and transparent to other values. -UTF-8 is relatively easy to perform mutual conversion in any of UCS-4 and UCS-2. - The first 8-bit byte of the 8-bit byte sequence indicates the number of 8-bit bytes in the series. -8 Bacteriographic Fe and FF will never appear. - Where is the character boundaries from the 8-bit character stream be started easily. -UCS-4 string dictionary classification order is retained. Since the classification order is not culturally effective in either case, it is of course limited. -Boyer-Moore fast search algorithm can be used for UTF-8 data.

-UTF-8 strings can be verified by a simple algorithm, that is, in any of the codes, verify that the effective UTF-8 string is low, which is reduced as the character length is increased. UTF-8 Source from the X / Open International Organization Xojig project, used to describe the security UCS conversion format [FSS-UTF] of the file system, so that the Unix system is compatible, and supports multiple languages in a single code. Text. The start of the author is Gary Miller, Greger Leijonhufvud and John Entenmann. Later, Ken Thompson and Rob Pike made a lot of work in UTF-8 format. It is also possible to find UTF-8 description from Unicode Technical Support Report # 4 and Unicode Standard 2.0 [Unicode]. The authority reference, including the provisions of the UTF-16 data containing UTF-8, and is described in ISO / IEC 10646-1 [ISO-10646] Appendix R. 2, UTF-8 Definition In UTF-8, the characters are encoded by 1 to 6 8-bit bytes. In a sequence of an 8-bit byte, the high level of the byte is 0, and the other 7 is used for character value encoding. N (n> 1) A sequence of 8-bit bytes, the initial 8-bit byte high N bit is 1, then one bit is 0, the bit of this byte contains the bit of the encoded character value. The highest bit of all 8-bit bytes is 1, then the next bit is 0, the remaining one of the bytes 6 bits contain the encoded characters. The following table summarizes these different 8-bit byte type format. Letter X pointed out that this bit comes from the encoded UCS-4 character value. UCS-4 range (16 credits) UTF-8 series (binary) 0000 0000-0000 007F 0xxxxxx 0000 0080-0000 07FF 110xxxxx 10xxxxx 0000 0800-0000 fff110xxxx 10xxxxxx 10xxxxxx

0001 0000-001F FFFF 11110XXX 10xxxxxx 10xxxxxx 10xxxxxx 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxxx 10xxxxxx 10xxxxxx 0400 0000-7FFF FFFFFFF 1111110X 10xxxxxx ... 10xxxxxx

From the UCS-4 to the UTF-8 encoding process, as follows: 1) The number of 8-bit bytes required from the character value and the first column of the above table are determined. It is highlighted that the lines in the above table are mutually exclusive, that is, for a given UCS-4 character, there is only one valid code. 2) Prepare the 8-bit byte in each line in the second column in the upper table. 3) Fill the bit of the character value in the marked x place, starting from the low position of the character value, putting them in the last 8-bit byte in the series, and then placed the next 8-bit byte of the character value, Such repetition, until all the bits of all marking bit X are filled.

In theory, the algorithm from UCS-2 to UTF-8 encoding can be obtained by extending each UCS-2 character with 2 0-value 8-bit bytes. However, the UCS-2 value between the D800 to DFFF is pair (the unicode said is a proxy), which is actually UCS-4 character conversion via UTF-16, so it is necessary to treat: UTF-16 conversion must be unfinished, Convert to UCS-4 characters first, and then convert it according to the above process. From the UTF-8 to the UCS-4 decoding process as follows: 1) Initialize all bits of the 4 8-bit bytes of the UCS-4 characters are 0.2)) Based on the number of characterized characters and the second columns in the above table according to the sequence (tag) To X-bit) to determine which bit codes are used for character values. 3) Assign bit to UCS-4 characters from the coding sequence. First, start from the lowest position of the last 8 bit byte of the sequence, then proceed to the left until the bit marked as X is completed. If the UTF-8 sequence length is not more than 3 8-bit bytes, the decoding process can directly assign UCS-2.

Note - The actual implementation of the above decoding algorithm should be safe to handle the unifielded series. For example, a naive implementation may (Error) Decoding invalid UTF-8 series C0 80 is character u 0000, which can lead to security issues and / or other issues. See the safety consideration below. More detailed algorithms and formulas can be found in [FSS_UTF], [Unicode] or [ISO-10646] Appendix R. 3. Standard version ISO / IEC 10646 has been updated again by issuing amendment. Similarly, different versions of the Unicode standard are: 1.0, 1.1 and 2.0. Each new version is abolished and replaced with an old version, but the implementation and more important data did not update immediately. In general, increasing new characters will not trigger a special problem for old data. However, ISO / IEC 10646 corrects 5 movement and expands Korean Hangul group, so the previous version of the HANGUL character is invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The formal reason to allow this uncoordinated change is that there is no Hangul in the implementation and data. This change event is called "Korean Chaos", and the relevant committee guarantees that this will never make such an uncoordinated change. About MIME character encoding labels, new versions, and specific uncoordinated changes have presence or in the past, Section 5 will discuss. 4, Example UCS-2 series "A ." (0041, 2262, 0391, 002e) is encoded with UTF-8 as follows: 41 E2 89 A2 CE 91 2E to Korean "Hangugo" (D55C, AD6D, C5B4), the UCS-2 sequence that represents the Hangul character can be encoded as follows: ED 95 9C EA B5 AD EC 96 B4 For Japanese "Nihongo" (65E5, 672C, 8A9E), indicating the Chinese characters UCS-2 sequence can be encoded as follows: E6 97 A5 E6 9C AC E8 AA 9E5, MIME Registration This Memo Plan Services The Foundation of MIME Character Set Parameters [Charset-Reg] Registration Base. The character set parameter value mentioned is UTF-8. This character label media type contains characters text consisting of ISO / IEC 10646 instructions, ISO / IEC 10646 includes all amendments to amend 5 (Korean group). This type uses an 8-bit sequence sequence encoding using the encoding scheme outlined above. UTF-8 is suitable for use in the upper layer type of text. It is worth noting that the "UTF-8" tab does not contain version IDs that are generally submitted by ISO / IEC 10646. The reason for this is as follows: The design of the MIME character set label is only used to give information that needs to be translated from the byte sequence to the character sequence from the wired, and there is no other purpose (see RFC 2045, 2.2 [MIME]). As long as the character set standard has no incompatible change, the version number is meaningless, because one party receives the new allocation character that does not know, there is no thing to understand through the label. The label may be received at any time, the label does not provide any information for new characters. Therefore, as long as the standard is appropriately improved, the benefits of identifying version tags are obvious, but the demonstration of depends on the version is: When the old app receives a new data containing new notk, it may know The label fails, and cannot complete the data processing; and a normal familiar tab will lead to most of the correct data processing, which may not contain any new characters. Today, "Korean Chaos" (ISO / IEC 10646 Correction 5) is an uncoordinated change, theoretically contradictory the applicability of the version-independent MIME character set tag described above.

However, compatibility issues only appear in Korean Hangul character data encoded by Unicode 1.1 (or equivalent ISO / IEC 10646). It can be proved that there is no such data to be worried, so this is the main reason why it is not coordinated to change. In fact, it is assumed that the label is understood to be referenced to all versions of the correction 5, and it is assumed that there is no uncoordinated change, and the version of the label is reasonable. Thus, unless ISO / IEC 10646 is incompatible with the version, the MIME character set definition will be consistent with the previous version unless IETF is clearly defined. It is also planned that the registration character set parameter value is "Unicode-1-1-UTF-8", and the only use is for tagged text data. The label text data contains the Hangul syllables that are not considered in the ISO / IEC 10646 correction 5 (ie, the code point assignment before the correction 5) is encoded into UTF-8. Other UTF-8 data should not use this label, especially the data that does not contain any HANGUL syllable. Very important strong recommendation is to oppose the case where ISO / IEC 10646 correction 5 is not considered, any new data containing HANGUL. 6. Safety considers UTF-8 implementation requires how to handle security considerations to handle illegal UTF-8 sequences. It is conceivable that an attack in some environments may perform an attack is a UTF-8 syntax that is not allowed to give a UTF-8 analyzer. This attack is a particularly sensitive form is an attack analyzer. This analyzer performs security authentication validity check on the input UTF-8 encoded format, but explains some illegal 8-bit bytes as characters. For example, when encountered a single 8-bit sequence number, the analyzer may prohibit NUL characters, but allow illegal two 8-bit sequence C0 80 to explain it as a NUL character. Another example is a parser that is forbidden from 8-bit sequence 2f 2e 2e 2f ("/../"), allowing illegal 8 torthase 2F C0 AE 2E 2f. Acknowledgments The following persons participated in the drafting and discussion of this memo: James E. Agenbroad Andries Brouwer Martin J. D | rst Ned Freed David Goldsmith Edwin F. Hart Kent Karlsson Markus Kuhn Michael Kung Alain LaBonte John Gardiner Myers Murray Sargent Keld Simonsen Arnold Winkler reference [ Charset-reg] FREED, N., AND J. Postel, "Iana Charset Registration Procedures", BCP 19, RFC 2278, January 1998. [FSS_UTF] X / Open Cae Specification C501 ISBN 1-85912-082-2 28cm. 22p PBK. 172G. 4/95, X / Open Company Ltd.

[ISO-10646] ISO / IEC 10646-1: 1993 International Standard - Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1:.. Architecture and Basic Multilingual Plane Five amendments and a technical corrigendum have been published up to now. UTF-8 is described in Annex R, published as Amendment 2. UTF-16 is described in Annex Q, published as Amendment 1. 17 other amendments are currently at various stages of standardization. [MIME] Freed, N., and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045. N. Freed, N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types ", RFC 2046. K. Moore," Multipurpose Internet Mail Extensions] Part THREE : Message Header Extensions for Non-Ascii Text ", RFC 2047. N. FREED, J. Klensin, J. Postel," Multipurpose Internet Mail Extensions ", RFC 2048. N. FREED, N. Borenstein, "MultiPurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples, RFC 2049. All November 1996.

[RFC2152] Goldsmith, D., And M. Davis, "UTF-7: a Mail-Safe Transformation Format of Unicode", RFC 1642, Taligent Inc., May 1997. (Obsoletes RFC1642)

[Unicode] The Unicode Consortium, "The Unicode Standard - Version 2.0", Addison-Wesley, 1996. [US-ASCII] Coded Character Set - 7-Bit American Standard Code for Information Interchange, ANSI X3.4-1986. Author address Francois Yergeau Alis Technologies 100, Boul. Alexis-Nihon Suite 600 Montreal QC H4M 2P2 Canada

Phone: 1 (514) 747-2547 Fax: 1 (514) 747-2561 email: fyergeau@alis.com

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind , provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations , except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.

..............

This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY Rights or Any Implied Warranties of Merchantability OR Fitness for a Particular Purpose.RFC 2279 - UTF-8, A Transformation Format Of ISO 10646 UTF-8, ISO 10646's Conversion Format

7RFC Document Chinese Translation Program

转载请注明原文地址:https://www.9cbs.com/read-88609.html

9cbs

New Post(0)