How to distinguish whether the text is BIG5 or GB?

zhaozj2021-02-08  253

Regarding the discrimination of GB and BIG5 code, there is no 100% reliable method. But from the distribution of two encoding system Chinese characters,

There is also a certain judgment.

The encoding range of the GB code is the first byte A1-F7, the tail byte A1-Fe, and the first byte A1-A9 is a symbolic area, AA-AF

A section is not defined. The coding range of the BIG5 code is the first byte A1-F9, the tail line segment is divided into two sections, and 40-7e and

A1-Fe, the first byte A1-A3 is a symbolic area, and A4-C5 is commonly used.

This way we can find a few basis:

1) Tail byte 40-7e is a BIG5 code, which can be discriminated, but now GBK is also defined in this code area.

Character, but these Chinese characters are not high, so they can still be used as a basis, but it is not possible to guarantee 100%.

correct;

2) The first byte A4-A9 is a Japanese fake name, Greek letter, Russian letter and tab, and normal text.

The AA-AF is not defined at all, but this range is a common Chinese character of the BIG5 code, so if the text is

This range of code frequently occurs, or it is considered to be a BIG5 code. Especially the first one is located between AA-AF, the tail

The section is located in A1-Fe, almost 100% is the BIG5 code is undoubted, because even in GBK, this range is not fixed.

Righteous.

In order to improve the correctness of the identification, it is best to use a number of criteria simultaneously. In addition, you can also analyze the frequencies of Chinese characters.

Rate, or find some common phrases to discriminate. Because it is more complicated, there is not much to say.

The first byte C6-D7, the tail byte A1-Fe belongs to the first grade font in GB, which is commonly used, and in BIG5, C6-C7 is not

There is clear definition, but it is usually used to play Japanese pseudonymous and serial numbers, and C8-D7 belongs to the halon zone. So if this range

There is more code that appears, and it can be discriminated as GB code.

转载请注明原文地址:https://www.9cbs.com/read-1381.html

New Post(0)