What is unicode?

xiaoxiao2021-03-20  240

Analysis of Unicode and UTF-8

http://blog.9cbs.net/resterjames/archive/2005/09/28/491619.aspx

Dialect around the world

First, some coding schemes that are commonly used now:

1. In China, the most common use of the mainland is GBK18030 encoding, in addition to the GBK, GB2312, which is like this.

N The earliest development of Chinese characters is GB2312, including 6763 Chinese characters and 682 other symbols

N was revised in N 95, named GBK1.0, and a total of 21,886 symbols were included.

The GBK18030 encoding has been introduced, a total of 2,7484 Chinese characters, and also included the main minority texts such as Tibet, Mongolian, and Uyghur Wen, and now the Windows platform must support GBK18030 encoding.

According to the order of GBK18030, GBK, GB2312, three coding is backward compatible, and the same Chinese is the same in three encoding schemes.

2. Taiwan, Hong Kong, etc. use BIG5 encoding

3. Japan: SJIS Coding

2. Unicode

If you describe a variety of text codes as dialects around the world, Unicode is a language in cooperation development in all countries.

In this language environment, there will be no language coding conflicts. Under the same screen, any language can be displayed, which is the biggest benefit of Unicode.

So how is Unicode coded? In fact, it is very simple.

It is to encode all the words in the world with 2 bytes. Maybe you will ask, can 2 bytes can represent 65,536 encodings, is it enough?

Most Chinese characters in South Korea and Japan have spread from China, and the font is exactly the same.

For example: "Text" word, GBK and SJIS are the same Chinese characters in the same Chinese characters, just coding.

In such a unified code, 2 bytes are enough to accommodate most of the world in all the languages ​​in the world.

UCS-2 with UCS-4

Universal Multiple-Oct Coded Character Set is referred to as UCS.

Now use UCS-2, ie 2 byte encodings, and UCS-4 is to prevent 2 bytes from two bytes from being not developed in the future. UCS-2 is also known as basic multi-cultural plane.

UCS-2 is converted to UCS-4 just simple to add 2 bytes 0.

UCS-4 is mainly used to save auxiliary plane, such as the second auxiliary plane in Unicode 4.0

Side - 24000-24FFF - 25000-25FFF - 26000-26FFF - 27000-27FFF - 28000-28FFF - 29000-29FFF - 2A000-2AFFF - 2F000-2FFFFFFF - 2F000-2FFFF

A total of 16 auxiliary planes were added, and the original 65536 coding was extended to nearly 1 million codes.

3. Compatible with CODEPAGE

So since it is unified, how to compatible with the text coding from the original country?

At this time, you need CodePage.

What is codepage? CodePage is a mapping table between Chinese text coding and Unicode.

For example, the mapping table of Simplified Chinese and Unicode is CP936, click here to view the official mapping table.

Here are a few common CodePage, which modifies the number of the above address.

CodePage = 936 Simplified Chinese GBKCODEPAGE = 950 Traditional Chinese BIG5

CodePage = 437 United States / Canadian English

CodePage = 932 Japanese

CodePage = 949 Korean

CodePage = 866 Russian

Codepage = 65001 Unicode UFT-8

The last 65001, according to personal understanding, it should be just a virtual mapping table, actually just an algorithm.

Take a line from 936, for example:

0x9993 0x6abd #cjk unified ideography

The front code is the encoding of GBK, followed by Unicode.

By checking this table, you can simply implement the conversion between GBK and Unicode.

4. UTF-8

Now I understand Unicode, then what is UTF-8? Also why is UTF-8?

ASCII is converted into UCS-2, just inserting an 0x0 before the encoding. With these codes, some controls, such as '' or '/', which will have a serious error in UNIX and some C functions. Therefore, it is certain that UCS-2 is not suitable as an external encoding of Unicode.

Therefore, UTF-8 was born. So how is UTF-8 encoded? How do I solve the problem of UCS-2?

example:

E4 BD A0 11100100 10111101 10100000

This is the UTF-8 encoding of "you" word.

4F 60 01001111 01100000

This is "you" Unicode encoding

According to the UTF-8 coding rules, the decomposition is as follows: xxxx0100 xx111101 xx100000

The unicode encoded "you" is becoming "you".

Pay attention to the top of the UTF-8 1, indicating that the entire UTF-8 string is composed of 3 bytes.

After the UTF-8 encoding, the sensitive character will never appear because the highest bit is always 1.

The following is a conversion relationship table between Unicode and UTF-8:

U-00000000 - U-0000007F: 0xxxxxxx

U-00000080 - U-000007FF: 110xxxxx 10xxxxxx

U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

U-00010000 - U-001FFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U-00200000 - U-03FFFFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 - U-7FFFFFF: 1111110X 10xxxxxx 10xxxxxx 10xxxxxxx 10xxxxxx 10xxxxxx

Unicode encoding is converted to UTF-8, which simply turns UTF-8 in Unicode byte flow sleeves to X.

转载请注明原文地址:https://www.9cbs.com/read-130353.html

New Post(0)