How to determine a paragraph of Chinese is also BIG5 encoding?

xiaoxiao2021-03-06  60

// Judgment BIG5 implements the following procedure tform1.button1click (sender: TOBJECT); var i: integer; str: string; beginstr: = 'How to let him change to Traditional "; // str: =' can support mutual Turn software is not more '; for i: = 1 to length (str) do if not odd (i) Then IF Byte (Str [I]) <127 Then ShowMessage (byte (Byte (Byte 'Has a BIG5 contains');

In this way, it can be determined in most cases whether the string is BIG5 is also Delphi, using the << Function GBTOBIG5 (Value: String): String; >> Realize the process of GB >> BIG5, convertable It is indeed garbled. In fact, it should be converted correctly, but the label1 font chars is still using GB_3213, this should pay attention to it.

Little information 1, =============================== When we get a code, how to determine if it is a GBK code If it is a GBK code how to locate its location?

It should be judged that a GBK code should be relatively simple, and we can determine according to its valid range. Such as:

IF 0x81 <= CH1 <= 0XFE AND (0x40 <= CH2 <= 0x7e OR 0x7e <= CH2 <= 0xfe): #is GB Char

Here, CH1 and CH2 are high bytes and low bytes of characters, respectively.

How to locate (why do we have to say that we will say later)? First introduce the code table. The code table is formed together, you can put it in the file (here the description is to place the encoded in the file). When we store the encoding, we will put together the actual code (because some combinations do not exist), and is in the order of byte size. According to GBK's coded range, we can imagine a two-dimensional coordinate, the ordinate is high byte, the abscissa is low byte, and each intersection is a Chinese character, accounting for two bytes. The number of Chinese characters on such a line should be 0xFe-0x40 1-1 = 190 (plus 1 is because of the 0x40 to count. Decrease 1 is because of the desired 7F). When positioning, we will subtract 0x81 with high byte to obtain a longitudinal offset. Use low bytes to subtract 0x40 to get the abscissa offset. Multiplion with a longitudinal coordinate offset with a number of Chinese characters, plus the abscissa offset to get the offset of the Chinese character. Multiply by 2 to obtain an offset of bytes. Then the positioning algorithm is:

INDEX = ((CH1-0X81) * 190 (CH2-0X40) - (CH2 / 128)) * 2

There is - (CH2 / 128) in the above algorithm. This is because there is no 7F code in the GBK, so when CH2 is less than 7F, CH2 / 128 = 0, the 7F is not calculated. When CH2 is greater than 7F, CH2 / 128 = 1, the value of 7F is calculated, so it is to be removed. Since a Chinese character has two bytes, it is necessary to multiply 2. In this way, we get a GBK Chinese character in the byte position in the code table.

BIG5 is the coding set used in Hong Kong and Taiwan. It ranges: high bytes from 0xA0 to 0xFE, low bytes from 0x40 to 0x7e, and 0xA1 to 0xFe. It is determined whether a Chinese character is BIG5 encoding, which can be judged as the coding range of characters. How to position? Then also imagine that all codes are arranged as a two-dimensional coordinate, and the ordinate is high byte, the abscissa is low byte. Such a number of Chinese characters: (0x7e-0x40 1) (0xFe-0xa1 1) = 157. The positioning algorithm is divided into two pieces, which is: if 0x40 <= CH2 <= 0x7e: #is Big5 Char Index = ((CH1-0XA1) * 157 (CH2-0X40)) * 2 ELIF 0xA1 <= CH2 <= 0xfe: #is big5 char index = ((ch1-0xa1) * 157 (CH2-0XA1 63)) * 22, ======================== ================ is generally this to identify GB / BIG5:

1. Two bytes of GBCode's internal code are between A0H-Feh;

2, the first byte of the internal code of Bigcode is 80H-FFH, the second byte is 00h-ffh;

You have to browse the full text, see if there is a second byte is a Chinese character that is less than 7FH. If there is, it is generally Bigcode. Of course, there are special circumstances, but very rare.

转载请注明原文地址:https://www.9cbs.com/read-117552.html

New Post(0)