How to use character sets and encoders

zhaozj2021-02-16  61

When you write a program, sometimes you have to write characters into the file. Such as follows:

Import java.io. *;

Public class encode1 {

Public static void main (string args [])

THROWS IOEXCEPTION {

Writer Writer = New FileWriter ("OUT");

Writer.write ("Testing");

Writer.close ();

}

}

When you run at the Solaris series of operating systems or Windows platforms, the text file OUT is only 7 bytes. This is the result you expect.

But there is still an important issue here. Java characters are 16 bits, which means that each character is 2 bytes long. The program encode1 writes 7 characters into the file out. And the result is a 7-byte long file. You may ask: Other characters have been there. Is it here to write 14 bytes into a file?

This issue is attributed to "character encoding". This problem is how to map 16 characters in Java into 8-bit bytes to the file. In fact, there is a very good mechanism here, not simple zoom, zoom 8 or 16, because hundreds of characters worldwide are available. That is to say: This special 8-bit character sequence needs to be re-combined into a Java string due to different platforms, and venues.

The Java system is to solve this problem by choosing special coding configurations due to different needs. At the same time, it also provides a defirming character encoding based on your platform and environment. As with the example above, the Java system supports the default character encoding for I / O operation. Alternatively, you can specify other encodings (character sets). These character encodings are a string to describe, such as "UTF-8". It can also be an example of a java.nio.charset.Charset class. Charset is an abstract class, so in fact this example is a subclass of a Charset class.

In an Encode1 example, a way to solve the encoding problem is to break the character into two bytes to write into the file. Actually, this file may not have a byte to spread it. Another way is to abandon the high levels in the Java characters. This approach can be used above, but you try to write a Greece, Japanese string will not succeed.

In this example, it is actually treated with the second method (its high byte abandonment). If you put the output line in the encode1: Writer.Write ("Testing"); modified: Writer.write ("Testing / U1234"); this output line will be 8 bytes instead of 7 bytes. This is the case, this unified encoding character / u1234 still cannot be displayed in one byte.

There are two meanings of "abandoning" in the previous discussion. If the high position of the Java character is 0, it is like a character to indicate 7-bit ASCII, then "abandoning" means to go to the high byte. Another means in some environment you can't use a special character with mapping. In this case, characters (2 bytes) may be replaced by one by default. Just as the / U1234 in the above example is replaced by 0x3f.

Let's take a look at how to use the character set, mapping between characters and bytes. A basic problem is: What is the character set? Below this program demonstrates a list:

Import java.nio.charset. *;

Import java.util. *;

Public class encode2 {

Public static void main (string args []) {

Map availcs = charset.availablecharsets ();

Set keys = availcs.keyset ();

ITerator iter =

Keys.ITerator (); it.hasnext ();) {system.out.println (iter.next ());

}

}

}

Its output result is as follows: (no * characters)

ISO-8859-1 *

ISO-8859-15

US-ASCII *

UTF-16 *

UTF-16BE *

UTF-16LE *

UTF-8 *

WINDOWS-1252

* Number is here to distinguish between the Java platform characters.

Another basic question is: What is the default character set in your own system? Below this program shows the name of this default character set:

Import java.io. *;

Import java.nio.charset. *;

Public class encode3 {

Public static void main (string args [])

THROWS IOEXCEPTION {

FileWriter FileWriter =

New FileWriter ("OUT");

String encName =

FileWriter.Getencoding ();

FileWriter.Close ();

System.out.println (

Default Charset IS: " Encname);

/ *

Charset charset1 =

Charset.Forname (Encname);

Charset charset2 =

Charset.Forname ("Windows-1252");

IF (charset1.equals (charset2)) {

System.out.println (

"CP1252 / Windows-1252 Equal");

}

Else {

System.out.println (

"CP1252 / Windows-1252 Unequal";

}

* /

}

}

When you run this program, what you see is as follows:

DEFAULT CHARSET IS: CP1252

Note that this character set is not in the list of character sets required for each Java implementation. Because the default character set is not one of the necessary characters. This example also has some commentary logical description demonstrates that you can decide if the selected character is the default character. It shows that "Windows-1252" and "CP1252" character set is the same character set. This logic also shows that because "CP1252" is not necessary to support, this logic doesn't make much sense to you.

You may see that there is another way to get the name of the default character set: Find system properties "file.encoding". This method works, but this attribute does not guarantee that all Java platforms have been defined.

In the Encode3 program, Charset.Forname is used to find a character set object in a string of characters ("US-ASCII).

Another example here is using this technology:

Import java.nio.charset. *;

Public class encode4 {

Public static void main (string args []) {

IF (args.length! = 1) {

System.out.println (

"Missing Charset Name");

System.exit (1);

}

String charsetname = args [0];

Charset charSet;

Try {

Charset = charset.forname (charsetname);

System.out.println (

"Charset Lookup Successful");

Catch (unsupportedcharsetException exc) {

System.out.println (

"Unknown Charset:" charsetname);

}

}

}

If you run as follows:

$ java encode4 xyz

It will check if "XYZ" supports the character set in this system. If support, you get this character object.

How do you use these character sets after you know all the backgrounds? The first example eNCode1 will be repeated here.

Import java.io. *;

Public class encode5 {

Public static void main (string args [])

THROWS IOEXCEPTION {

FileOutputStream FileoutStream =

New FileOutputStream ("OUT");

Writer Writer = New OutputStreamWriter

FileoutStream, "UTF-8");

Writer.write ("Testing");

Writer.close ();

}

}

The Ecode1 program does not look easy. It uses the default character set based on platform and environment. Compare these two examples. Encode5 program uses a standard character set (UTF-8). And very early, the default character set used in the Encode1 program is byte by discarding high character. Use the UTF-8 character set to solve this problem. If you put the output line:

Writer.write ("Testing");

Modified: Writer.write ("Testing / U1234");

It still works normally. The character set UTF-8 has a very great advantage when processing 7 ASCII.

There is another example here. It indicates how you transform the Java string into byte vectors, specific encoding.

Import java.io. *;

Public class encode6 {

Public static void main (string args [])

Throws unsupportedEncodingexception {

String str = "Testing";

Byte Bytevec1 [] = str.getbytes ();

Byte Bytevec2 [] = Str.getbytes ("UTF-16");

System.out.println ("Bytevec1 Length ="

BYTEVEC1.LENGTH;

System.out.println ("Bytevec2 Length ="

BYTEVEC2.LENGTH);

}

}

Output in your system may be like this:

BYTEVEC1 Length = 7

BYTEVEC2 Length = 16

The first transformation is to use the default character set. The second conversion is to use the UTF-16 character set.

There is also the last thing about character encoding to be discussed.

What can you still want to know how this specific mapping or encoding algorithm is? There are some codes from DataouPutStream.Writeutf. It is used to map character vectors into byte vectors.

For (int i = 0; i

C = charr [i];

IF ((c> = 0x0001) && (c <= 0x007f)) {

Bytearr [count ] = (byte) c;}

ELSE IF (C> 0x07FF) {

Bytearr [count ] =

(BYTE) (0xE0 | ((C >> 12) & 0x0f);

Bytearr [count ] =

(BYTE) (0x80 | ((C >> 6) & 0x3f);

Bytearr [count ] =

(BYTE) (0x80 | (((C >> 0) & 0x3f);

}

Else {

Bytearr [count ] =

(Byte) (0xc0 | (((c >> 6) & 0x1f);

Bytearr [count ] =

(BYTE) (0x80 | (((C >> 0) & 0x3f);

}

}

The characters are removed from Charr. Transform into 1-3 bytes long. And write it into bytearr. Characters in the 0x1-0x7 (7-bit ASCII) range are mapped into them. The character value is 0x0 and characters within the range of 0x80-0x7 are mapped to 2 bytes. So other characters are mapped to 3 bytes.

转载请注明原文地址:https://www.9cbs.com/read-25086.html

New Post(0)