Chinese information processing

zhaozj2021-02-16  53

Computer processing Chinese

Human life is in the ocean of information, and the divided seconds are inseparable from information. Language text is unique to human society. Every major innovation in the information processing method has prompted human society to enter more civilized stages. In ancient, the beacon's wolf smoke is war information; the invention of paper and orthodontics, is the revolution of information representation and storage method; telegraph, telephone, television invention, is the revolution of information processing and transmission methods; typewriter, electricity The inventions of the passage, the tora, make the information processing of the language text into the mechanization phase; the electronic computer as a powerful information processing tool has enable human beings into information-based society.

In my country, Chinese information processing is no longer a new thing. "Chinese" is generally refers to China's common language text, including Chinese characters and other minority words; narrowly said that Chinese characters. "Information" means that it can be obtained by visual, auditory, olfactory, taste, tactile, etc., and has a certain communication function; the so-called "processing" means various processing using the computer, the main is the image. Identification, Simulation, Analysis, Conversion, and Transmission of Information and Language Information.

Chinese information processing research, generally form a wide variety of systems, such as Chinese character information processing systems, editing typography systems, intelligence retrieval systems, program teaching systems, machine translation systems, various databases and expert systems. In addition, there are voice recognition systems, Chinese synthesis systems, various communication systems, human-machine dialog systems, and more.

There is a common feature of the above system, that is, it is inseparable from the computer. The Chinese information processing here is not a computer typing of the average person, but through the computer to handle and process Chinese. China is a great ancient country with five thousand years of civilization. It has an oracle in three thousand years ago, and it is in the world. In front of the new technology, China's Chinese characters cannot directly enter the computer, and they have been impacted by changes. In 1880, Danish has prepared Chinese character telegrams, used for telegraph transmission of Chinese characters; in 1956, my country's scientific workers Hui Wenhao put forward the "Coded Theory", he believes that the Chinese characters are compiled into 4 digits of electric code, and Change to a point and a system (Chinese characters), both processes are code-based, and the Chinese characters can be transmitted and processed as information. Since then, the knowledge of the knowledge of Chinese character information is studying, overcoming various difficulties, and has created nearly 1,000 Chinese character input coding programs, with more than two thirty three.

The Chinese character is not created for use on the computer, and the computer is not invented for the treatment of Chinese characters. It is not the ultimate goal of the code to the Chinese characters only. The ultimate goal of studying Chinese information is to make a metaphor. It is to allow the computer to grow into a Chinese-style brain, grow Chinese-style earpiece, become a Chinese robot with high intelligence, realize intelligence, printing plate version, office automation Wait, contribute to the promotion of modernization. How to enter the computer

Computers called the new technological revolution, the magical power seems to be omnipotent. However, the computer only recognizes two symbols, namely 0 and 1, 0 are empty numbers, 1 is a password, not 0 and 1 in the arabic number. The computer uses a binary system, not a decimal system. In general, the computer's information processing of language characters is mainly to 26 Latin letters and 10 Arabic numbers and some punctual expressions, identification, transmission, and replication. The code of the ABC, ABC and 123 in the ASCII code is as follows:

A: 01000001 (41H) B: 01000010 (42H) C: 01000011 (43H)

A: 01100001 (61H) B: 01100010 (62H) C: 01100011 (63H)

1: 00110001 (31H) 2: 00110010 (32H) 3: 00110011 (33H) When we enter A, tap the A key on the keyboard, of course, the computer does not know A, but you can know the code of A, namely: 01000001, And after transferring it, when you output it, then restore it. When entering English, if you enter a book, tap these alphabet keys directly, and the binary code of B, O, O, K, namely: 01000010, 01101011, 01101111, 01101011.

The Chinese character has different cases of entering the computer. The total number of Chinese characters is about 60,000, and there are more than 600 components of such a multi-Chinese character, and such a large number is unfair to binary code in the computer. The method of solving is to make the Chinese characters into the outer code expressed by letters or numbers. If the word "Lili" is encoded, it is displayed with the pinyin. After the encoding is LI, click the L and I key, a string of the LI sound will appear, and then select the word you want to click on the serial number. In this way, the transmission speed is very slow, in order to reduce the weight, speed up the transmission speed, the general method is to add the norms of the homophone after li, such as the "Li" to "艹" "艹" " The alphabet C, L, "Li" encoding is LICL, which is basically no longer heavy, can be directly entered. There is also a method to add the sound code 4 (four sound) and the head code u (艹), "Li" code is Li4U, which is encoded, readable. With a glyph code, you will first split "Li" into a zerft (head or smaller Chinese character components): 艹, 艹, 刂, "five-pen type" code is ATJ, this may be heavy, to introduce the end The pen code 2, the word code 2, constitutes the identification code J (22), so "Li" encoding is atjj. Therefore, whether it is a spell code, the Chinese characters are switched into letters (or numbers) to enter the computer. When outputting, the letters are converted into Chinese characters. This is like a trains, the passengers cannot enter the carriage, and the renminbi must be replaced by the ticket. Go to the end of the station, the passenger returns to the unit reimburse, and the ticket is replaced by the RMB.

The national standard GB2312 character set 1, the second-level Chinese character has a total of 6763, which is the basic set of information processing character set. At present, most machines have so many Chinese characters, which is obviously not enough, when entering name, ancient or Japanese, Some words can't be played. The expanded GBK character set has 20,902 Chinese characters. The special franc library requires 60,000 Chinese characters, the number of Chinese characters is large, and the difficulty of encoding input is also large. Therefore, the study of Chinese character encoding input still needs to continue. Early Chinese code encoding

The earliest Chinese character coding can be traced back to 100 years ago. In 1880, China founded the Chinese Television News Bureau, Danish's design of 4 digits, used to transmit Chinese characters. The telegram code is replaced with a Chinese character with 4 digits. It is arranged in the order of the Chinese characters listed in the dictionary. It doesn't matter with voice, strokes, and components. It is an unreasonable code. It can only die hard, low efficiency. But a skilled reporter can transmit 130 Chinese characters per minute. Now, there are still some computers that have telegraph code Chinese characters input.

In 1926, the Japanese invented the "universal Chinese typing keyboard", on the 70 × 35 written, income of more than 2,000 Chinese characters and symbols, and entered it with the button method. Later, Toshiba changed to stroke input. The advantage of the big keyboard is intuitive, and the disadvantage is that the speed is slow, and the equipment is bulky. There is also a primary key-auxiliary keyboard, Japan and the United States are designed and manufactured. The keyboard is about 5,000, arranged on 168 main keys, 30 Chinese characters per key, and set 30 auxiliary keys, with 30 Chinese characters assigned to the primary key correspond to the corresponding. After proficiency, you can lose 2000 words per hour. The above mode is not available. The four-corner numbers appeared in 1928. There are a lot of codes of the same codes in this program. In 8877 Chinese characters, a set of code represents 88% of the two Chinese characters. In 1959, the Soviet Academy of Soviet Union developed "Han-Russian" When the translation machine, the original 10 strokes increased to 15 kinds. Each Chinese character uses 5 digits, the first 4 digits represent the four-pointed strokes of the Chinese characters, the last digits are different, no heavy words are 0, the words with heavy code are respectively It is set to 1, 2, 3, etc. In 1963, IBM, US IBM, using Lin Yutang's "upper and lower inspection lettering", taking the left corner pen shape of Chinese characters and the lower right strip coding. In 1970, Jiang De was improved, and the above programs were improved, determined 34 "pen" and 22 "last pen" coding, heavy code selection input, this is the first tail code, easy to learn, but slow.

Glyph decomposition code, generally used for medium keyboard. In 1961, Du Duyou published "Root Research", summarized 504 roots, which can form all common words for encoding. Hu Lihe, etc., "Triangular Number Act", take three corners of each word (word root) encoding, the word is set to 300, combined into 99 parts, row on the keyboard of the 100-key, each strike three times You can enter. Wang An Company has adopted this program. Yang Lianzhi proposed "stroke letter" encoding method, decomposes all Chinese characters into 21 strokes, each stroke corresponds to a Latin letter, input in order of Chinese characters, this coding is unequal, but in standard 26 Enter on the keyboard of the letter key. Li Jinzhi's pencil code is similar to that of Wang Yongmin's five-stroke encoding.

In more than 100 Chinese character coding schemes in the 1960s to 70s, the stroke decomposition coding proportion is very large, including Hong Kong, Taiwan's Chinese scientists, mostly tend to be such programs, why, many people due to dialects Impact Master Putonghua is poor, and some words cannot pronounce correctly. However, after years of research and practice, finally concluded: To properly write strokes of general Chinese characters, it is much more difficult to read them correctly. Thus, some researchers turned to Pinyin coding. Interim of Chinese character encoding

In a stroke coding scheme, there are many types of Chinese characters basic strokes, 4, 5, 6, 8, 10 to 21, 24, 33 strokes. In the ribgen code scheme, the number of different rings is 100, 200 and 400 to 500, and the split method is also difficult to master. In 1958, my country has implemented the "Chinese Pinyin Solution", and each word has a specified pronunciation, which is very advantageous for the code.

"Telegraph Pinyinization", which Zhou Youguang, published in 1965. The composition of his Chinese characters is: 1. Pinyin section, and the "Xinhua Dictionary" is the same; second, standard letter (after the syllable); Yangping X, Echo V, to the sound h; three, the letter letter: The first part of the Chinese characters into 20 groups, each group is replaced by 1 letter, such as: "Li Li Lao Hao Lulong" is a group, by L Instead, such as the "station" pinyin code is zhanhl (ZHAN-pinyin; H-to the sound; L- "stand" the letter letter). Most of the fixed letter is 1, a small number of 2. This is a total pinyin code, having good readability, no heavy code in 10,000 words. There is a full-scale consignment code, only two parts of the sound mother and the rhyme, and there is no sound letter and the letter letter. The same word is more, you need to display the selection input, and compress the sound, the rhyme, such as the current double-coded compression scheme. To: A-EN, B-IA / UA, C-UN, D-AO, F-AN, G-Ang, H-Iang / Ung, I-SH, J-Ian, K-IAO, L-IN, M-IE, N-Iu, O-UO, P-Ou, Q-ER, R-EN, S-Ai, T-ENG, U-CH, V-ZH / ü, W-Ei, X-UAi, Y-ONG / IONG, Z-UN,; - ING. For example, please Q; You NL, Xin XL, reward IG. This encoding will have an average of 2.97 letters, and all 6 letters of Chinese characters are all unified to 2 letters, namely a double-parent, and composite rhyme is replaced by 1 letter. Because Chinese Pinyin's inner and venom have ideal mathematical structures, double spelling is alternatively, and the letters are rapid, and they have received extensive attention. At present, the installed use face is quite wide. The earliest to advocate this kind of two-way planning in China is Li Jinxi, Tang Yi and others, Yu Li Wen, Guo Shuzhen, Li Jin, etc. The pinyel code of the design is the sound rhyme, and the natural code that has been shocked will also use a similar double floor plan.

There are more than 400 syllables in Chinese, calculated by 6763 Chinese characters, each tone is more than 27 homotics; there are more than 1,200 syllables after the tone, each tone is saved, there are 6 homogeneous words, the key to the double floor plan is how Determine the third fourth letter to properly differentiate the same sound, reduce the weight. Guo Shuzhen and others have designated the gray code of 189 points into 23 categories, each type corresponds to 1 letter, and divides the department into five categories: nature, biology, physiology, life, and remain categories. The third letter is the first system; the fourth letter is determined by the first class (five major categories) and the pen crosstab. In the crossed table, the pen is divided into vertical folds, and the horizontal must be 5 lines; the department is vertical, and 25 letters in the table. Examples, for example: afraid of PAXM (M is the corresponding letter of the first painting and physiological intersection of the white. The rules are more troublesome, and it will be improved to "sound sound", afraid of PAXB, X, B is the vertical and white pronunciation letter. This is a typical sound code. Chinese character encoding period

Due to the convening of the science conference, the implementation and popularization of microcomputer technology, the implementation of microcomputer technology, and Chinese character encoding entered the period in the 1980s, the new solution continued to emerge, totaling more than 700. Among them, the comprehensive indicators of excellent programs are greatly improved than the early days, and they are put into practical. In March 1986, the relevant state departments held a national Chinese character code program evaluation. There were 33 program participation, and 11 A categories, they are: mass code, 50-character code, sound digital, macro-graphic code, hierarchical four corners Code, top three, one yard, tie, pencil code, joint 45-3 yards, CK code, JDL no interval code. The average transmission speed of 11 Class A schemes is 43.16 words / min, which are mainly single-word input. In October 1987, the "Chinese Cup" Chinese character entangled in China, the operator is 70 words / division in the specified word competition, and all in the self-selected word game is 100 words / min; 1990 In the year, on the cross-strait Chinese computer performance competition, the professional operator single-word input reached 147.8 words / min, the word input was 203.3 words / min. From these two competitions, two trends can be seen: (1) The first time is the code to win, and later the speech code is leading; (2) The word input is dominated. The word input is an important sign encoding into the period. For the two-word word, the code, the code, the sound code, the sound code, the sound code, the rhyme code; for three words, code, the first word root code " And the third word, the audio code, the speech code, the second word of the pronunciation, the sound of the third word; the rhyme code; for the words above the four words, the code is taken, two, three, The first word root code of the last word takes a first, two, three, and the last letter. The number of words input makes the number of hits, and the transmission speed has increased. Word input, for the code, still have to master the summary table and all split rules; for the speech code, just master the rhyme of the Chinese characters, and can be entered by the voice memory after reading, this is probably It is the reason for the speech code.

In the word input mode, 2 words to 20 words or more words, usually use 4 letters to enter, such as the "Office of the State Council" GWYT (one or two three), this, 8 words, an average of 0.5 key 16 words, an average of 0.25 / button per word. Someone eliminates the sentence to select a specific article, saying that you can enter 500 words per minute, but it can only be a specific article, you can't do it for the general article. We can't determine the quality of the code based on this special performance.

This period is generally used in standard keyboard input, mostly with 26 letter key inputs. In addition, in the program design, there is also a high frequency first, the word Lenovo and other functions. For beginners, it is easy to learn, and these functions are not used for skilled operators.

The long-term progress of Chinese character encoding input technology during this period has been dramatically launched in the field of computer typing, laser typography and other fields. The printing industry can fully replace traditional lead-word printing and have better quality, higher efficiency, more comfortable and smaller working environment. People excitedly praised that the printing industry began to bid farewell to lead and fire to enter the era of light and electricity. How to develop Chinese character encoding technology

There are more than 700 Chinese character coding programs. Some people say, too much, it is dazzling, you should choose one, two as a norm or standard to end the chaos of the five flowers. Others say that Chinese characters are so much coding, and there is no ideal. The high standards of Chinese information processing have a certain reason for this later opinion.

What is the ideal code? Yes, no. It seems to ask: What is the ideal country? There is no country (the country), the country is always a class oppression tool. Chinese character encoding seems to be the Chinese information handling, smashing, can be free to speak, dance. The current Chinese character encoding input is basically the task of completing the copy, especially the glyph code, and even the sort can not be resolved, intelligence retrieval, machine translation is not possible. Professor Qian Wei once said: Good Chinese character encoding has not yet come. Good Chinese character encoding should be a text, or a quasi-text. Japan's Chinese character input, start with a large keyboard, in the future, and finally use input Japanese Roman words to Japanese Chinese characters, have developed successful Japanese Roman characters - Chinese character fake name translator. Japan's telegraph has early use Japanese Roman writing (pinyin) telegraph. Some scholars have predicted that China will take the path of Chinese Pinyin text, develop "Chinese Pinyin Writing - Fang Chinese Character Translation Machine" to completely solve the problem of Chinese information processing. This is of course a long-term plan and goal. In recent days, the sub-items of Chinese character encoding and comprehensive indicators will have new research, new improvement, new development.

Some people have proposed that the Chinese character keyboard input is divided into three phases: word processing, word processing, and sentence processing three phases. Due to the cause of technology development, the initial cannot complete the word handling (strictly said that it is multi-word handling), can only enter single word, divided into word processing and word processing phase is reasonable, and it is also in line with the actual situation. In the word processing phase, 100% use single-word processing; in the word handling phase, about 70% -80% use multi-soundproofer processing, the rest of the word (mono word) processing. In fact, in terms of words, some phrases, phrases, and sentences have been included. Sentence processing, it should be said that an article has 70% -80% of the whole sentence, we have not seen such an encoding, such a computer, will not be used in the future. Scientific research shows that the perspective width of the human eye is 8 letters. Even if 8 Chinese characters, a 32-word sentence, people look at the four sights can be read, and they will forget the front, the whole sentence Obviously inconvenient. See a paragraph in one sentence, actually or the word input. Step by step, the sentence processing can be established, then the subsequent fourth stage is the segment processing, the fifth stage is the article, the sixth phase is this (volume) process, which is impossible. The future coding is basically developed by the word input. Intelligentization and Chinese character encoding input

The current electronic computer uses large-scale and large-scale integrated circuit chips, belongs to the fourth generation computer. In 1981, Japan announced the strategic document of the development of the fifth-generation computer, causing worldwide vibration. The fifth-generation computer has the ability to store knowledge, analysis, judgment, and reasoning, and language, graphics, image processing, and a variety of intelligent skills, one sentence, is a computer with artificial intelligence (AI). Today, such a smart machine has not yet been developed, it is a violation of science today. Many people think that computer can only simulate people's intelligence, controlled by people, what to enter, it can output what, so.

In the numerical operation, it has honestiness, uniqueness because each number and arithmetic symbols, so, although hundreds of millions of rules, the results will be accurate. If the plus number and the number of symbols are the same symbol, although several operations are composed, the results are difficult to determine. The same is true for Chinese character information processing. If each word corresponds to a single, unique code group, the result of its input and output will also be accurate. However, the current Chinese character encoding, the monopolism in most scenarios, the phrase has a heavy code, and the result of the input and output is very accurate. That is to say, in the text information processing, semantic information is determined by speech, glyphs, etc., if voice, the word information itself is incomplete, it is difficult to make up by semantic information, or to make up for the smart to make up for the intelligence of the computer. No way. The earliest pinyin input method is to tap the sound of the sound mother, show a string of homogeny words (arranged according to the location or frequency), such as the double fight HD: 1 good 2 3 times 4 spent 5 Haot 6 Hao ... then select input . Word Lenovo can be said to be the extension of this method, after entering "good" word, display a stroke and "good" word group word, such as: 1 turn more than 3 to 4 sense 5 Han 6 people 7 things ... Select 6 can be composed: good people, etc. This intelligence should be said to be very primitive. Recently, "intelligent word-related treatment", such as "century", "reagent" is a heavy code, but automatically generates "20th century" and "chemical reagents". This is the treatment of language environment. If the language environment is unclear, this related process will have difficulties. Such as: "New Reagent" rather than "New Century", generally, this is manually handled, and the heavy code is still more than 15 "actual" "hi", "related processing" completely solve heavy cryptography It is impossible to trouble.

To completely resolve this issue, you must first distinguish between the word code, such as "Century", "Reagent", "actual", to increase the identification of the word meaning or glyph information, to add characters indicating the tone, this, the code encoded Long increase. The higher the intelligence, the stronger the encoding monolay, and finally, the encoding will evolve into a text. Computer reading: pattern identification of text

Can a computer can read the text? It should be said that "reading" can be said. In the 1950s, foreign experiments were tried to identify special fonts, which was initially successful. In the end of the 1960s, a practical machine for identifying a handwritten Arabic digital has been identified and commodified. The research focuses on the research focuses on handwriting Latin letters and printed Chinese characters. At the beginning of the 1980s, more than 3,000 recognition machines have been recognized, and more than 1,000 sets of Japan. The input speed is generally 2000-3000 characters per second. It is said that the highest can be more than 14,400 characters, the error rate, the rejection rate is small, this recognition speed is 100 times faster than the human eye. This is a machine that identifies the Latin letter.

The configuration of the optical character recognition machine mainly includes three parts: one, the acquisition device of the text mode: is sent by the paper feed mechanism, which transmits the text material to be identified to the photoelectric converter, and the photoelectric converter converts text symbols in a scanning manner. The analog video signal is converted into a binary point array signal according to a certain threshold. The above steps are similar to signal processing of radio faxes with black and white television. Second, the analysis device of the text mode: the character electrical signal obtained in the previous unit, performs preprocessing of noise and compression information, and then calculates the characteristics of its geometry according to the skeleton, endpoint, node of the text symbol, and extracts the characteristics of its geometry. Press from top to bottom to arrange the feature value from left to right, and encodes it, and feed it into the next device. Third, the discrimination device of the text mode: Prepare a standardized graphic for each of the characters in the font, and stored in a dot matrix coding. The character feature encoding from the previous unit is then compared to the pre-stored standardized graphics, and the result is obtained by coarse to the fine, step-by-step classification, resulting in the final discrimination. The number of characters is large, and the difference is small, it is easy to cause errors. The 1 and 7, 3 and 5 of the handwritten are easily mixed. In contrast, the number of characters is small, the difference is large, and the distinct feature is more easier to identify. The identification rate of Latin letters and Arabic digital prints has reached 99.99%, and this indicator can be achieved in the United States and Japan. my country is close to this indicator.

Japan and the United States began to develop Chinese character recognition machines, 70-180, Japan has a "identification of print Chinese characters" plans, proposed various programs, but did not achieve success. The conclusion of the researchers is that difficulties are in principle, but it is better to say that it is technical. Latin alphabet is enough to use 16 × 16 grid, while the square Chinese character is not enough to use 60 × 60 grid. Japan often uses 2 Chinese characters, 75 times more than 26 English letters, in order to distinguish a large number of characters, the amount of information to be processed reaches more than 500 times more than 500 times English letters. China National Standard, 6763 Chinese characters, and the difficulty of identification will be 2,000 times in English. At present, various handwritten identification systems in the market can generally only recognize single Chinese characters, and its effective recognition rate is only 94%, which depends on the system of printed Chinese characters, and its identification rate is only around 98%. Overview of machine translation

The machine translation is to let the machine replace the artificial translation. The process of machine translation can be divided into four steps: 1. Source language input: Pinyin text material for all translated sources of origin, input by computer keyboard; also entered in optical recognition. Second, the identification and analysis of the source language: Computer is identified by a word sign according to the interval symbol (the minimum unit of semantics), and then identifies the syntactics and semantics according to the punctuation and some feature word. Then look up the dictionary and syntax tables, semantic forms stored in the machine, transfer these processed semantic information to the "rule system", and analyze the deep structure from the surface layer structure. Third, the generation and synthesis of the purpose of the purpose: reverse the first two processes, that is, from the deep layer back to the surface layer, the surface layer of the world is generated, and the structure of each level. 4. The translation processing processing in the computer is completed, and a series of binary digital signals are obtained, then convert these digital signals into text. If both languages ​​use Latin letters, output, and input available terminals. If the two languages ​​use different pinyin letters, they must be equipped with another set of terminals.

In the machine translation field, the most headache is related to Chinese character Chinese. Even if the fee is vigorously developed to automatically translate Chinese terminal equipment, the efficiency of the machine is also very low, and expensive equipment costs will have the advantages of automatic translation into zero. Experts believe that the most realistic and reliable way to translate Chinese in Chinese is the use of Chinese Pinyin text and use a Latin alphabet system.

The machine translation has been developed from the 1950s into the 1970s to enter the second generation, ie, in form linguistics as guiding theory, with sentences to machine translation of the measuring sentence. In 1971, the quality of the translation into French machine and translation is not very ideal. After statistics, the translation sentences that can be understood from 50%, barely understandable sentences account for 28%, and the sentences that cannot be understood account for 22%. This is the result of huge investment, causing severe attacks, the research of machine translation, once entered a low tide, some people put forward, high quality translation of fully automated, unobeding in soon, in the near future. People continue to study, scientists believe that machine translation must be associated with artificial intelligence, starting the development of third-generation machine translation, which is based on intelligent simulation, the natural language is understood as guidance theory, with a semantic analysis conversion foundation, sentence The section is a machining unit, the sentence is translated by the segment of the sentence. In this regard, it has aroused extensive attention. In the United States, many experts are expected to improve the quality of machine translation through artificial intelligence in the near future.

Machine assisted translation, also called semi-automatic translation. In the computer, the computer is stored in the computer first, and the computer retrieves the vocabulary part at a relatively shallow level, and people processes syntax, semantics, and rhetorical aspects in a deeper level. Translation speed and quality, reduce the more actual value of the cost.

转载请注明原文地址:https://www.9cbs.com/read-19977.html

New Post(0)