CMS Chinese Code Problem Analysis and Solution
Opening note: Some comments believe that UTF-8 encoding will solve all language coding problems. I don't agree to this opinion: one for Chinese users, Simplified and Traditional users don't usually have two codewords for the river; second, the current large number website uses GB / BIG5 encoding, and MB / ICONV GB / BIG5 - UTF -8 Conversion support is not ideal.
In the development of CMS, it must be facing language problems, mainly to multibly byte processing, including encoding charSet conversion, word length calculation, etc. In general, ordinary multi-byte issues can be solved with MB / ICONV modules, but for Chinese, the two modules brought by PHP are not perfect. XOOPS Chinese version uses XCONV Modules - Iconv for Xoops. This module uses a lookup replacement method, low efficiency, but there is no way. [XOOPS can freely select the encoding method, if you choose UTF-8 encoding, the following content rights is a tea]
This article will try to do the following analysis [Whether or not to prepare in the existing information, the logic and time order]: BIG5 scoring problem introduction, analysis and solution; XCONV module introduction and use; multi-byte issues and processing schemes.
BIG5 scifting problem analysis and solution
When processing BIG5 encoding, there is a famous issue "Limit" - BIG5 scifting must face, and the extracted part will be extracted from Taiwan.
Talk about Xu Lai
How do Xu Yong? The number of people, as long as they have used the PHP MYSQL standing station, no one knows no one, and it is absolutely painful in these standings ... (Excerpted from OSCommerce Shopping Website
Carefully explore the reasons, you will find that the wrong person is actually not a lot of money, but is known as the big five-code BIG5 code, and the historical legend of the big five yards has no statement. We don't comment on the original coding and right and wrong. Readers who are interested in this history may wish to search for "BIG5", I believe you can find a bunch of information. Since the big five yards seem to have some problems, why is this code that currently all traditional Chinese? In the 1983-1984, the personal computer is gradually promoting in Taiwan. The suit software on the computer has begun to prevail, in order to solve the problem of computer processing Chinese, it is to develop a set of Chinese internal code, which is the "BIG5 code" we are not known. Big five code. And people who have experienced this time must not forget the marketing techniques of the original Chinese in the world. At the time of foreign software manufacturers, they will introduce the original software concept, but the heaven Chinese is anti-road, allowing campuses and even general users unconditionally. Free copy of the Chinese system is copied. Therefore, the Yi Tian Chinese has almost become Chinese standards at the time, but because of the BIG5 encoding, it is strictly, which is the main reason why the BIG5 code has been used in today.
Where is the big five yard wrong? Wrong does not exclude the control code of the US standard information exchange code asCII (American Standard Code for Information Interchange), all those who read computer instead know that ASCII is by Byte, and 1 byte = 8 bits, so ASCII can have up to 2 ^ 8 = 256 characters. For only 26 letters, there are more than 26 letters, but there are absolutely not enough for tens of thousands of words, so you must use two bytes to represent a Chinese word. The encoding of the word "in" is "A4A4". However, in the BIG5 code design, in order to avoid conflicts with the ASCII, the first Byte of each Chinese word uses only the high-character character (129-255) in ASCII, but in the second Byte used some low words (1 -128) This is the biggest gang of BIG5 code in the future application. Why is the BIG5 code to find PHP MySQL? The reasons are three: one, SQL hidden code problem: We know, if you want to get the grammar when you pick up your information in the MySQL database:
Select * from administrator where id = 'ABC' and passwd = '1234'
Suppose I have a login.php web page, the content is used to enter the ID and passwd value form (from), if someone is directly listed directly from the URL:
Login.php? id = ABC & Passwd = '% 20OR% 201 = 1% 20OR% 201 ='
Since% 20 will be interpreted as blank by the browser, the SQL syntax that last throwing to MySQL is:
Select * from administrator where id = 'abc' and passwd = 'or 1 = 1 OR 1 =' '
This is a constant style, deceived verification, and obtains the authority of Administrator. Therefore, single quotes become a top invisible killer. Second: PHP's jumping character 5c is used in PHP as a jumping character, that is, when the text inside the variable is with single quotes or double quotes, in order to correctly display these special characters correctly, You usually need to add more \, common examples, such as: Phpecho "