Full-text search system for machine translation

xiaoxiao2021-03-05  41

Abstract: This article introduces the design and implementation of the full-text search system for machine translation, which is realized inverting

On the basis of the normal full-text retrieval system functions such as reservoir and Boolean logic retrieval, location search, retrieval correlation order

Provide multi-level retrieval and cross-language retrieval functions for machine translation. For machine translation chapter, paragraph blur

The problem of retrieving, this paper proposes a method of refining, refined inspection, and by analyzing the characteristics of the document and choosing appropriate

The retrieval expression model solves the problem of correlation judgment in the machine translation retrieval, calculating in the sentence similarity

The idea of ​​using dynamic planning is used.

Keywords: machine translation, full text search, paragraph search, chapter search

I. Introduction

With the in-depth and computer technology development of linguistics, machine translation technology has developed rapidly.

A group of practical machine translation systems, especially the development of the Internet, making the network machine translation system

Life. Given the machine translation is a comprehensive field involving linguistics, calculating mathematics, computer technology, cognitive science.

Complex cross science, due to the intrinsic complexity of language itself and restrictions on the level of artificial intelligence, machine

The quality of translation translation is still a certain gap between the objective needs, often requires manual editing, and translation speed

Because of the need to use the dictionary, the rules do a large number of grammar, semantic analysis and the demand for users also have a gap, thus improving the machine

The correctness of translation translation is extremely arduous.

Because of the needs of repeated translation, especially the webpage on the Internet online has a lot of inheritance, so we propose

The reservoir has been compiled by translation or high quality, using existing translation experience, continuously improves machine translation.

Speed ​​and quality ideas. The author is based on the design principle and rules of the general full-text search system, for the machine translation system

Features, design and realize a full-text search system for machine translation. The full text retrieval system not only has improved

Full text search function, and provides multi-level retrieval and cross-language retrieval functions for machine translation.

Second, function and overall structure

This system also provides information retrieval functions for users and machine-oriented translations. User-oriented retrieval provides ordinary full

The basic functions of the text search system, users can make full use of bilingual information collected, support cross-language retrieval. Facing

The retrieval assisted machine translation system for machine translation is translated. If the user has translated similar to the machine translation system

Document (or paragraph) proposes translation requirements, the system can directly call translation information stored in the bilingual information library, from

Increase the translation system processing speed; in addition, since the translations stored in the information library have varyed degrees of translation

Edit, therefore, the translation result provided to the user is more accurate.

The system design and implementation is based on the following main principles: (1) inheriting the function of the general full text retrieval system; on the basis of it

Provide the mechanism for related feedback, increase the search function used by the machine translation system; (2) The model has openness, support

The expansion of multi-language; (3) The system is easy to maintain, maintain the consistency of the Chinese and English index structure; (4) Meet the network ring

The translation of the situation is high in real-time, and the amount of information is high.

On the basis of inverted mounting, the system uses the Boolean retrieval mode that complies with the user query.

The retrieval of translation provides fast and accurate search results. The system structure is shown below:

Each module function:

* Pre-processing module for information documentation

Information Prerequisites Format Filter on Different Sources of Non-pure Text Documents, and the system is saved and its original document and corresponding

Pure text documentation, enabling users to retrieve text information existing in different formats.

* Index Module

The index module analyzes the document in the document library, and establishing various index information to provide the foundation and guarantee. main

The content includes: Creating an inverted arbitrary record of the document feature information; establishing a two-language document and its internal paragraphs

Department; for text analysis, extract external features.

* Search module for users

According to the user's query requirements, the characteristic record of the document information is read, and the information you need is found to the user. The main contents include: retrieving the processing, retrieval processing, retrieval expansion processing, correlation sorting, related feedback, etc.

User retrieval is also the basis for machine translation retrieval. The system first proceeds to the input retrieval expression

Analyze and check the error, then search according to a single checkout word, after the combination of the detectable words in the retrieval expression

Finally, the search result is obtained and the output is sorted.

* Machine translation retrieval module

In response to the machine translation system, the query requirements of paragraphs, according to system ambiguity, in the bilingual information library

Retrieve the same chapter paragraph and its translation, or give the conclusions of the failed object in the bilingual library. This is this department

Core module.

Third, search for machine translation

Chapter, less probability of precise matching, and easy implementation. How to translate the need for machine translation, fast, accurate

Find a "similar" chapter, the paragraph is the focus and key issues of the entire search. This paper uses a step-by-step method

achieve. For chance retrieval, first match the external feature, if present, the matching results are directly refined

. If there is no existence, extract its keyword (set) group is equipped into a retrieval expression for a reduction. Then in the reduction result

The line supports the refinement of the fuzzy match, and finally obtains the search result. For paragraph retrieval, directly according to the shortage, refined way

Treatment.

3.1 Repairment

Regeneration refers to the first one of the topic (collections) that must be retrieved, paragraph characteristics, then, according to these themes

The search type of the word looks for related chapters in the inverted arranging, paragraph, and quickly reduces the processing of refined detection.

3.1.1 Task Extraction

Since the network information retrieval has a high real-time requirement, the subject's word extraction here is for constructive search, plus

Quickly retrieve the speed, so, the extraction of the subject matter is impossible to make a detailed syntax semantic analysis, nor is it suitable for reverse

The word frequency method, so the system uses the following statistical method. Determine the subject word when the system is preferred

Signature: 1) The keywords appearing in the title or subtitle sentence, and the level of the level of levels. Higher title

Greater weight. Level Title is extracted by the hierarchy of machine-oriented translation. 2) Digest, keyword, etc.

Key words set. 3) The keywords that appear in the first and section of the section. 4) Under the same conditions, the word frequency is high and the length is long.

The quotation weight is large.

The calculation formula of the theme word weighting function is:

Among them, PW is the cumulative position weight; FREG is the word frequency of the word; LEN is the word length, LMIN is the lower limit of the word; C is a common

Number, for Chinese words, long words are high, C can be made, for English, the gap is less than Chinese,

C is small.

The initial value of the PW is defined as 0. For each appearance of keywords in the above case: 1) Title, PW = (;

In the topic, PW = PW 10 * i (i is hierarchical); 2) PW = PW 5; 3) PW = PW 1; Key words in other sentences

When there is a time, the total number of words in the PW = PW 1 / sentence.

3.1.2 Relevance Retrieval

Since this full text retrieval system supports the search in the same paragraph in the same paragraph, the paragraph search structure is relatively simple.

, Add the same positional operator between the extracted subjects, and then use this search to find the phase in the inverted

Guanzhi.

The search of the chapter is a correlation discrimination. At present, the system has more systems with better results in relevant judgment.

Space model, such as SMART experimental system under the leadership of Salton, but this retrieval model has not been applied in a practical system. Some systems use the way to connect all the subject words or computational connections, retrieved in the inverted gear

Reduce a certain range, then generate spatial vectors to all documents within this range to determine the degree of correlation with the query document

. However, I think this method is not high, the response time is too long, and it is not suitable for our system real-time requirements.

The seipher of this system uses a weighted search question, avoiding the Boolean search model cannot express a feature word.

The extent of the extent, and it is easy to implement on the selected model. Its method is to provide the weight of the subject in the search style

The document meets the retrieval conditions based on the correlation degree of retrieving the relevant degree of the document and the query.

Metric Similarity formula:

Among them, the TF * IDF is used to designate the document keyword. M is the total number of documents in the database, and NT is a text that contains word T.

The number of times, the FDT is the word frequency. It is the length of the document, and is obtained by calculating the quantity of the standard.

3.2 refined inspection

The refined inspection refers to the process of further matching and obtaining final retrieval results in the collection of candidate documents obtained in the shortest.

The system priorits the comparison of important features, and eliminates the unobstructed document as soon as possible, and reduces the range of post-processing.

A chapter that needs to be retrieved is first decomposed into paragraphs and retrieves according to paragraph refinement method. Paragraph refined inspection allows a certain amount of ambiguity. When the structural characteristics of the two paragraphs substantially matches, further divide the sentence, and the calculation of the symmetry is performed, and finally determines whether the paragraph matches. The system uses a dynamic plan to calculate the similarity of the sentence. The words in the sentence to be translated are listed as the I-J plane I axes, and the words in the example sentence are listed as the J axis, where the value of the grid points (i, j) is the similarity between the word I and the word J. The similarity between the two statements is the path to the origin to (i, j), and the value of the statement similarity is the sum of the matching points passing on the path. Then, the similarity calculation between sentences is converted to find an optimal path in the I-J plane, so that the similarity of the two statements is the largest. For the pursuit of speed and accuracy, the current similarity query does not do synonym expansion and other operations, the morphological similarity D (IK, JK) at the state K node can be simply defined as: such as I, J, 1, otherwise 0. The state transfer equation is: (IK, JK) = UK (IK-1, JK-1). And the path that matches the similar statement has a certain limit condition: (1) Monotone limitation, the required path must be from the starting point to the right or upward. (2) The global path limit, the tilt path is better than the vertical or horizontal path. (3) Local path limit, the sequence node of (IK, JK) only calculates (IK 1, JK), (IK, JK 1), (IK 1, JK 1) three cases, not There will be a right angle. The similarity S of the origin to (i, j) full path is: the optimal recursive formula of dynamic planning is: the similarity between statements is defined as: where n is the number of words in the sentence to be translated. Take the similarity of the similarity as the search result. If there is no similarity larger than the threshold, return the query failed flag. In this way, we can define the correlation of paragraphs according to the correlation of each sentence, thereby retrieving the required paragraphs, or even chapters. 3.3 Correlation performance analysis of the shrinking is first introduced by the example of the principle of weighted retrieval. For example: Query the literature in the network machine translation in natural language processing, use the weighted search method as follows: Natural language processing (1) Machine translation (3) Network (2) In the literature, the literature contains three words, The weight is 1 3 2 = 6; if the literature contains natural language processing, machine translation, the weight of this document is 1 3 = 4 ... If the lower limit threshold is set, three words are included, or both words (except for natural language processing, the combination of networks) are hit. Below we make a comparison with the vector space model. In the vector space model, literature and questions represent a vector. Assuming that the literature collection has a common M different quarters T 1, T2, ..., TM, each document in the collection can be represented by several of the M-scale quotations. Either documentation can be represented as a vector space in the quantity of the quantity: D = (T11, T12, ..., T1M) AQ, one question q can also be represented -------- -------------------------------------------------- ------------------------------------

How do search engines work after the scene author The author of the Chinese Academy of Software Institute Zhu Jie computer handling data is constantly exponentially growing, with data and topics accumulated in the data store, how to quickly, effectively, and economically retrieve some All information about the theme has become a hot topic. One way to solve this problem is to adopt intelligent search technology. The structural tibia of natural language processing is given herein, and the full-text search technology that ultimately contributes to network users find information. Find information on information retrieval, representation, storage, organization, and access. That is, the information information related to it is retrieved from the information database according to the user's query requirements. Information retrieval has been created from manually establishing a keyword index, and the full-text information retrieval, automatic information deduction, automatic information classification is developed to the computer autologous index, and is developing in the direction of natural language processing. In the field of information search, the development of English information retrieval is more rapid. If the SMART information retrieval system developed by Salton et al., The vector space can be used to retrieve information content, and the natural language processing is applied to information retrieval, which greatly improves the accuracy of the information query. The development of Chinese information retrieval systems is relatively slow. At present, most Chinese retrieval systems are still keyword retrieval, and many systems are still in the "word" index phase. Not only the efficiency is low, but the accuracy and accuracy of the information retrieval are poor. The reason is because Chinese information retrieval has its own characteristics, such as there is no space between Chinese words, so you need to perform words before the index. On the other hand, compared with English, Chinese syntax analysis and semantic understanding are more difficult, causing Chinese information retrieval. The core of the information retrieval model information retrieval system is the search engine. It needs to filter information that meets the user needs from a large amount of information that is complex. For example, the user wants to query information about computer network product sales from the information library. If the result of the query is information about computer software products, the user's needs cannot be met. Different from the search engine search related information, information retrieval can be divided into: Boolean logic model, fuzzy logic model, vector space model, and probability model. Boolean information retrieval model, is the simplest information retrieval model, and the user can submit queries based on the Boolean logic relationship in the document, and the search engine determines the results of the query based on the inverted file structure established in advance. The standard Boolean logic model is binary logic, and the searched document is either related to the query or not with the query. The results of the query generally do not perform correlation. If the "computer" is queried, it is included in the query result as long as the keyword "computer" appears in the document. In order to overcome the disorder of the Boolean information retrieval model query results, the fuzzy logic is introduced in the query result processing, and the retrieved database document information is fucked by the user's query requirements, and the query results are arranged according to the relevant prioritization. . For example, query "Computer", then "computer" more documents will be arranged in a prime position. Different from the Boolean information retrieval model, vector space model uses the vector space of the search term to represent the user's query requirements and database documentation information. The query results are arranged according to the similarities of the vector space. The vector space model not only makes it easy to generate effective query results, but also provides information about the relevant document, and the query result is classified, providing users with accurate positioning information. The probability model based on the principle of Bayesian probability is different from the Boolean and vector space model, which utilizes the induction method of related feedback to acquire the matching function. Although different retrieval models are different, the goals to achieve are the same, and the information required for users is provided in accordance with user requirements. In fact, most retrieval systems tend to mix the above various models to achieve the best search effect. Information Retrieval System Structure Search Engine constitutes the core of the information retrieval system. However, the information retrieval system also includes several phases of pre-processing, index information, information indexing, and user information retrieval of index information document format. Information pre-processing information pretreatment includes information format conversion and two different levels of filtration.

As a mechanism accessing different information, the signal is accessible to data information in different organizations, such as various databases, different file systems, and network web pages. At the same time, information pre-processing can also filter documents in different formats. Such as Microsoft Word, WPS, TEXT, and HTML, etc. This allows the search engine not only retrieving a text document, but also retrieves document information in the original format. Information index information index is a feature record that creates document information. It enables users to easily retrieve the required information. Establish an index requires the following processes: Information language Pieces and words lexical analysis sentences are the minimum unit of information expression, and Chinese is different from Western languages, and there is no separator (space) between the words of the sentence, so it is necessary to perform word sections. There is a divergence in Chinese language, such as the sentence "user satisfaction" can be divided into "make / user / satisfaction", or it may be incorrectly cut into "use / household / satisfaction". Therefore, it is necessary to use a variety of context knowledge to solve the language cuts. In addition, the words of the words are needed to identify the stems of the various words in order to establish information indexes according to the word trust. Wordsmanship and related natural language processing are based on rules and regular (Markov chains). Words labeling. Normal syntax statistical analysis methods based on the random process of the Markov chain, proved to achieve higher precision in the terminology label. On this basis, it is also necessary to identify important phrase structures using various grammar rules. Establish a search index to establish a retrieval item related information in a manner, as shown in Table 1. Relevant information generally includes "search item", "file location information in the search term" and "retrieval item weight". For example, the location information of the search term "computer" is "the first word of the nth segment in document D". Thus, when the information retrieval, the user can request that in the query, the search term T1 and the search term T2 are located in the same statement or in the same paragraph. The establishment guidelines for the search term index are to be easily processed by the document information. Table 1: Typical inverted retrieval item list TERM1 DOCI, WTI1; DOCJ, WTJ1; _; DOCM, WTM1 Term2 Doci, WTI2; DOCK, WTK2; _; DOCN, WTN2... Terms Docj, Wtjs; DOCM, WTMS; _ ; DOCP, WTPS query expansion processing information retrieval evaluation is the accuracy and recall of information retrieval. The accuracy of the information retrieval is the ratio of the number of relevant information documents and the total number of query results in the search results. The recall rate of information retrieval is the ratio of the number of relevant information documents and the total number of information documents in the information library. In order to improve the recall of information retrieval, query expansion is required. This processing is based on synonym dictionary and semantic intention dictionary expansion query retrieval items. Synonyms extension, such as "Computer" and "Computer" pointing the same concept; thus querying "computer" at the same time, "Computer" is also queried, and vice versa. The subject implication extension refers to not only querying the search term, but also queries the subsequent concepts included. For example, the keyword "art" includes "movie", "dance", "painting", etc. "Movie" also includes "story film", "documentary", etc. Therefore, the query "art" of course includes "movie", "dance", "painting", and the subsequent under the subsequent. Improve the accuracy of the information retrieval, and can be implemented using the vector space model to perform related query feedback processing. That is, the user from the first query result, select the content important document or document segment, so that the search engine re-queries according to the characteristics of the selected document, thereby increasing the query accuracy. Information Classification and Summary In order to facilitate the user to select the required information from the query result, the search engine can classify the document information provided to the user according to the documentation and generate a short summary for each document. The search engine classifies and summarizes the query results according to the statistical features of the text search.

转载请注明原文地址:https://www.9cbs.com/read-31989.html

New Post(0)