Java character set and character coding set

xiaoxiao2021-03-06 72

Characters in the Java system are stored in double-byte, using Unicode (UTF-16) encoding.

(Estimated JDK subsequent version of Java character encoding may increase to 4 bytes, which can thoroughly solve the Oriental national font.)

UTF-8 is a standard storage coding format that has very good error correction and compatibility with UTF-8 encoding.

There will be no information loss when using the UTF-8 encoding (Encode) Unicode code. Of course, use UTF-8 decoding (decode) UTF-8 encoded byte stream,

There will be no information loss when generating a Unicode code. However, it is forbidden by UTF-8 to decode non-UTF-8 encoded byte streams. In short, UTF-8 can encode any Unicode code.

But only the UTF-8 encoded byte stream can be decoded.

UTF-16 and UTF-8 usage are the same, only a point difference: UTF-16 is double byte multiple encoding, UTF-8 is single-byte multiple encoding,

The byte stream after using UTF-8 and ASCII encoding in English countries, which is conducive to the system smoothly upgrade to the system that supports UTF-8,

But the system is to upgrade to support UTF-16 to update all data, which is obviously unacceptable. Note: UTF-16 has two codes in different bytes

ISO8859-1 is a frequently used character encoding format in Western countries. Use ISO8859-1 Encoding the characters in the Oriental Fonts section in the Unicode code to encode into it,

That is, information will be lost when using the ISO8859-1 encoding Unicode code. However, when ISO8859-1 decodes any (ISO8859-1 encoded and non-ISO8859-1 encoded) character flow,

The information will not be lost, this is because all 256 characters in one byte are legal for ISO8859-1.

Sometimes in some Linux operating systems and some application servers, the default decoding method is ISO8859-1, which is why most garbled.

GB18030, GBK, GB2312 is the code format of Chinese characters, with GB18030 (GBK, GB2312, and GB18030 are the same series, but the font is small,

But the way is the same, this is not distinguished, and the character will be encoded as the Unicode code when using GB18030).

That is, use GB18030 only encodes the Chinese and English characters in Unicode, and other characters will be lost.

Similarly, the GB18030 can only decode the character stream encoded by GB18030.

The XML file is to tell the browser to use the file to decode itself with the specified encoding format, of course, ask the browser to first support this encoding format (on the client),

The character set of the JSP page is to tell the JSP server to use the JSP file (in the server segment) to decode itself with the specified encoding format.

However, in the servlet program, response.setContentType ("text / html; charSet = GBK");

Is telling the servlet program encoding (in the server segment) with the specified encoded format

The basic idea of character set conversion is very simple, use some kind of character encoding rule code, use the coding rule decoding,

The deep level of frequent problems is that Java does not provide encoding information on the word stream, which can be considered a serious mistake.

It is estimated that future Java can provide such information. …to be continued

Where the code involving the code is: Java class file editing, the Java class file is compiled, the implementation file, the server is specified,

The JSP file is specified, and the servlet file specifies, the servlet file is specified, and the resource connection point configuration is specified.

Causes cannot be displayed normally in two places: character set; font. For universal software,

转载请注明原文地址:https://www.9cbs.com/read-91152.html

9cbs

New Post(0)