Full-text search system for machine translation

xiaoxiao2021-03-05 86

Summary of the full-text search system for machine translation: This article introduces the design and implementation of the full-text search system for machine translation, which is realizing the general full-text retrieval system function of retrofold storage structure and Boolean logic retrieval, location search, retrieval correlation. On the basis, providing multi-level retrieval and cross-language retrieval functions for machine-oriented translation. For the chapter of the machine translation, the chapter blur retrieval, this paper proposes a method of shrinking, refined inspection, and solves the relevant judgment of the machine-oriented translation search by analyzing the characteristics of the document and selecting the appropriate retrieval expression model. Question, use the idea of dynamic planning in the calculation of sentence similarity. Key words: machine translation, full text search, paragraph search, chapter inspection, introduction, with people's understanding of the in-depth and computer technology of linguistics, machine translation technology has developed rapidly, and a group of practical machine translation systems have emerged. In particular, the development of the Internet has made the network machine translation system should be born. Given the machine translation is a comprehensive cross science involving multiple fields such as linguistics, compute mathematics, computer technology, cognitive science, due to the intrinsic complexity of language itself and is limited by the current level of artificial intelligence development, the quality of machine translation is still There is a certain gap with objective needs, often need to be artificial editing, and the translation speed is due to the need to use the dictionary, the rule is a gap with the demand for the user, thus improving the correctness of the translation of the machine. It is extremely difficult. Because there is a need for repeated translation, especially the webpage on the Internet online has a lot of inheritance, we propose to store the previously manually edited or high quality translation results, using existing translation experience, continuously improve the speed of machine translation and Quality of ideas. According to the design principles and rules of the general full-text search system, the author is designed and implemented a full-text search system for machine translation. The full-text search system not only has improved full-text search functions, but also provides multi-level retrieval and cross-language search functions for machine translation. Second, function and overall structures The system also provides information retrieval functions for users and machine-oriented translations. User-oriented retrieval provides basic functions of a general full-text search system, enabling users to make full use of bilingual information collected, support cross-language retrieval. Retrieving auxiliary machine translation system for machine translation, if users have translated translation requirements if users have translated similar documents (or paragraphs), the system can directly call translations in bilingual information libraries, thus improving translation System processing speed; in addition, since the translation stored in the information library has varying degrees of episode editing, the translation result provided to the user is more accurate. The system design and implementation is based on the following main principles: (1) Inheriting the function of the general full-text search system; on the basis of providing related feedback mechanism, increasing the search function used by the machine translation system; (2) The model is open, support The expansion of multilingual species; (3) The system is easy to maintain, maintains consistency of Chinese and English index structures; (4) Meet the translation of the network environment, the query is high, and the amount of information is high. Based on the inverted gear, the system uses the Boolean retrieval mode that conforms to the user query habit, providing quick and accurate search results for user retrieval and machine translation. The system structure is shown in the following figure: Each module function: * Information documentation pre-processing module information Prerequisites Format filtration of different sources, and the system saves its original document and the corresponding plain text document, users can retrieve To the text information existing in different formats. * Index Module Index Module analyzes the document in the document library, and establishing various index information to provide the foundation and guarantee. The main contents include: Creating an inverter arranging of record document feature information; establishing a two-language document and its internal paragraphs; for text analysis, extract external features.

* The user's search module reads the characteristic record of the document information according to the user's query requirements, and check the information required by the user. The main contents include: retrieving the processing, retrieval processing, retrieval expansion processing, correlation sorting, related feedback, etc. User retrieval is also the basis for machine translation retrieval. The system first analyzes the input search expression and checks the error, and then retrieved according to a single checkout, and passes the combination of detectable words in the retrieval expression, and finally the search result is obtained and the output is sorted. * Machine translation retrieval module For the machine translation system, in accordance with the requirements of the paragraph, according to the system ambiguity, the same chapter paragraph and its translation are retrieved in the bilingual information library, or give the search object in the bilingual library. Conclusion. This is the core module of this system. Third, the search chapter for machine translation, less probability of precise matching of paragraphs, is also easy to implement. How to quickly, accurately find the "similar" chapter, the paragraph is the key and key issues of the entire search. This paper is achieved in a step-by-step method. For a chapter search, first match the external feature, if present, the matching result is directly refined. If there is no existence, extract its keyword (set) group is equipped into a retrieval expression for a reduction. Then support the refinement of the fuzzy match in the reduction result, and finally obtain the search results. For paragraph retrieval, directly press the shortest, refined manner. 3.1 Regeneration Regeneration refers to the first quotation of the keyword (collection) that is first extracted to retrieve a chapter, paragraph characteristics, and then find the relevant chapter in the inverted arranging according to these keywords. Process. 3.1.1 The subject term Extraction Since the network information retrieval has a high real-time requirement, the subject term extraction here is to construct the search type, speed the retrieval speed, so the extraction of the subject matter cannot be detailed. Semantic analysis is not suitable for use in reverse word frequency, so the system uses the following statistical method. Determining the subject term When the system prioritizes the following ternary: 1) The keywords appearing in the title or subtitle, and the level of levels of levels. Greater weight is given to the title of high level. Level Title is extracted by the hierarchy of machine-oriented translation. 2) Text collection, keywords, etc. Specifications. 3) The keywords that appear in the first and section of the section. 4) Under the same conditions, the word frequency is high and the length of longer length is large. The calculation formula of the theme word weighting function is: where the PW is the cumulative position weight; freg is the word frequency of the word; len is the word length, LMIN is the lower limit of the word; C is a constant, the specificity of Chinese words, long words Higher, C can be taken, for English, the gap is not as good as Chinese, and C is small. The initial value of the PW is defined as 0. For each appearance of keywords in the above case: 1) In the title, PW = (; hierarchical title PW 10 * i (i is hierarchical); 2) PW = PW 5; 3) PW = PW 1; Key words appear once in other sentences, the total number of words in the PW = PW 1 / sentence. 3.1.2 Correlation Retrieving Since this full text retrieval system supports the query in the same paragraph, the paragraph search structure is relatively simple, and the same segment operator is coupled between the extracted keywords. Then use this search to find the relevant paragraph in the inverted arranging. The search is a correlation discrimination. Currently, the system has multi-vector space model, such as the Smart experiment system under the leadership of Salton, but this search model has not been in the practical system. use.

Some systems use the way to connect all the above-extracted topics or computational connections, retrieved in the reverse gear to reduce a certain range, and then generate spatial vectors to all documents to determine the correlation with the query document. . However, I think this method is not high, the response time is too long, and it is not suitable for our system real-time requirements. The seipher of this system is used to use a weighted search question, avoiding the disadvantages of the Boolean retrieval model cannot express the extent of the feature word, and is easy to implement on the selected model. The method is to determine if the document meets the search criteria based on whether the weight of the subject matter is provided in the search formula. Metric Similarity formula: where the TF * IDF is used to designate the document keyword. M is the total number of documents in the database, and NT is the number of documents containing the word T, and the FDT is frequency. It is the length of the document, and is obtained by calculating the quantity of the standard. 3.2 Refine refinement refers to the process of further matching and obtaining a final retrieval result in a collection of candidate documents obtained in the shortest. The system priorits the comparison of important features, and eliminates the unobstructed document as soon as possible, and reduces the range of post-processing.

A chapter that needs to be retrieved is first decomposed into paragraphs and retrieves according to paragraph refinement method. Paragraph refined inspection allows a certain amount of ambiguity. When the structural characteristics of the two paragraphs substantially matches, further divide the sentence, and the calculation of the symmetry is performed, and finally determines whether the paragraph matches. The system uses a dynamic plan to calculate the similarity of the sentence. The words in the sentence to be translated are listed as the I-J plane I axes, and the words in the example sentence are listed as the J axis, where the value of the grid points (i, j) is the similarity between the word I and the word J. The similarity between the two statements is the path to the origin to (i, j), and the value of the statement similarity is the sum of the matching points passing on the path. Then, the similarity calculation between sentences is converted to find an optimal path in the I-J plane, so that the similarity of the two statements is the largest. For the pursuit of speed and accuracy, the current similarity query does not do synonym expansion and other operations, the morphological similarity D (IK, JK) at the state K node can be simply defined as: such as I, J, 1, otherwise 0. The state transfer equation is: (IK, JK) = UK (IK-1, JK-1). And the path that matches the similar statement has a certain limit condition: (1) Monotone limitation, the required path must be from the starting point to the right or upward. (2) The global path limit, the tilt path is better than the vertical or horizontal path. (3) Local path limit, the sequence node of (IK, JK) only calculates (IK 1, JK), (IK, JK 1), (IK 1, JK 1) three cases, not There will be a right angle. The similarity S of the origin to (i, j) full path is: the optimal recursive formula of dynamic planning is: the similarity between statements is defined as: where n is the number of words in the sentence to be translated. Take the similarity of the similarity as the search result. If there is no similarity larger than the threshold, return the query failed flag. In this way, we can define the correlation of paragraphs according to the correlation of each sentence, thereby retrieving the required paragraphs, or even chapters. 3.3 Correlation performance analysis of the shrinking is first introduced by the example of the principle of weighted retrieval. For example: Query the literature in the network machine translation in natural language processing, use the weighted search method as follows: Natural language processing (1) Machine translation (3) Network (2) In the literature, the literature contains three words, The weight is 1 3 2 = 6; if the literature contains natural language processing, machine translation, the weight of this document is 1 3 = 4 ... If the lower limit threshold is set, three words are included, or both words (except for natural language processing, the combination of networks) are hit. Below we make a comparison with the vector space model. In the vector space model, literature and questions represent a vector. Assuming that the literature collection has a common M different quarters T 1, T2, ..., TM, each document in the collection can be represented by several of the M-scale quotations. Either documentation can be represented as a vector space in the quantity of the quantity: D = (T11, T12, ..., T1M) AQ, one question q can also be represented -------- -------------------------------------------------- ------------------------------------

How do search engines work after the scene author The author of the Chinese Academy of Software Institute Zhu Jie computer handling data is constantly exponentially growing, with data and topics accumulated in the data store, how to quickly, effectively, and economically retrieve some All information about the theme has become a hot topic. One way to solve this problem is to adopt intelligent search technology. The structural tibia of natural language processing is given herein, and the full-text search technology that ultimately contributes to network users find information. Find information on information retrieval, representation, storage, organization, and access. That is, the information information related to it is retrieved from the information database according to the user's query requirements. Information retrieval has been created from manually establishing a keyword index, and the full-text information retrieval, automatic information deduction, automatic information classification is developed to the computer autologous index, and is developing in the direction of natural language processing. In the field of information search, the development of English information retrieval is more rapid. If the SMART information retrieval system developed by Salton et al., The vector space can be used to retrieve information content, and the natural language processing is applied to information retrieval, which greatly improves the accuracy of the information query. The development of Chinese information retrieval systems is relatively slow. At present, most Chinese retrieval systems are still keyword retrieval, and many systems are still in the "word" index phase. Not only the efficiency is low, but the accuracy and accuracy of the information retrieval are poor. The reason is because Chinese information retrieval has its own characteristics, such as there is no space between Chinese words, so you need to perform words before the index. On the other hand, compared with English, Chinese syntax analysis and semantic understanding are more difficult, causing Chinese information retrieval. The core of the information retrieval model information retrieval system is the search engine. It needs to filter information that meets the user needs from a large amount of information that is complex. For example, the user wants to query information about computer network product sales from the information library. If the result of the query is information about computer software products, the user's needs cannot be met. Different from the search engine search related information, information retrieval can be divided into: Boolean logic model, fuzzy logic model, vector space model, and probability model. Boolean information retrieval model, is the simplest information retrieval model, and the user can submit queries based on the Boolean logic relationship in the document, and the search engine determines the results of the query based on the inverted file structure established in advance. The standard Boolean logic model is binary logic, and the searched document is either related to the query or not with the query. The results of the query generally do not perform correlation. If the "computer" is queried, it is included in the query result as long as the keyword "computer" appears in the document. In order to overcome the disorder of the Boolean information retrieval model query results, the fuzzy logic is introduced in the query result processing, and the retrieved database document information is fucked by the user's query requirements, and the query results are arranged according to the relevant prioritization. . For example, query "Computer", then "computer" more documents will be arranged in a prime position. Different from the Boolean information retrieval model, vector space model uses the vector space of the search term to represent the user's query requirements and database documentation information. The query results are arranged according to the similarities of the vector space. The vector space model not only makes it easy to generate effective query results, but also provides information about the relevant document, and the query result is classified, providing users with accurate positioning information. The probability model based on the principle of Bayesian probability is different from the Boolean and vector space model, which utilizes the induction method of related feedback to acquire the matching function. Although different retrieval models are different, the goals to achieve are the same, and the information required for users is provided in accordance with user requirements. In fact, most retrieval systems tend to mix the above various models to achieve the best search effect. Information Retrieval System Structure Search Engine constitutes the core of the information retrieval system. However, the information retrieval system also includes several phases of pre-processing, index information, information indexing, and user information retrieval of index information document format. Information pre-processing information pretreatment includes information format conversion and two different levels of filtration.

As a mechanism accessing different information, the signal is accessible to data information in different organizations, such as various databases, different file systems, and network web pages. At the same time, information pre-processing can also filter documents in different formats. Such as Microsoft Word, WPS, TEXT, and HTML, etc. This allows the search engine not only retrieving a text document, but also retrieves document information in the original format. Information index information index is a feature record that creates document information. It enables users to easily retrieve the required information. Establish an index requires the following processes: Information language Pieces and words lexical analysis sentences are the minimum unit of information expression, and Chinese is different from Western languages, and there is no separator (space) between the words of the sentence, so it is necessary to perform word sections. There is a divergence in Chinese language, such as the sentence "user satisfaction" can be divided into "make / user / satisfaction", or it may be incorrectly cut into "use / household / satisfaction". Therefore, it is necessary to use a variety of context knowledge to solve the language cuts. In addition, the words of the words are needed to identify the stems of the various words in order to establish information indexes according to the word trust. Wordsmanship and related natural language processing are based on rules and regular (Markov chains). Words labeling. Normal syntax statistical analysis methods based on the random process of the Markov chain, proved to achieve higher precision in the terminology label. On this basis, it is also necessary to identify important phrase structures using various grammar rules. Establish a search index to establish a retrieval item related information in a manner, as shown in Table 1. Relevant information generally includes "search item", "file location information in the search term" and "retrieval item weight". For example, the location information of the search term "computer" is "the first word of the nth segment in document D". Thus, when the information retrieval, the user can request that in the query, the search term T1 and the search term T2 are located in the same statement or in the same paragraph. The establishment guidelines for the search term index are to be easily processed by the document information. Table 1: Typical inverted retrieval item list TERM1 DOCI, WTI1; DOCJ, WTJ1; _; DOCM, WTM1 Term2 Doci, WTI2; DOCK, WTK2; _; DOCN, WTN2... Terms Docj, Wtjs; DOCM, WTMS; _ ; DOCP, WTPS query expansion processing information retrieval evaluation is the accuracy and recall of information retrieval. The accuracy of the information retrieval is the ratio of the number of relevant information documents and the total number of query results in the search results. The recall rate of information retrieval is the ratio of the number of relevant information documents and the total number of information documents in the information library. In order to improve the recall of information retrieval, query expansion is required. This processing is based on synonym dictionary and semantic intention dictionary expansion query retrieval items. Synonyms extension, such as "Computer" and "Computer" pointing the same concept; thus querying "computer" at the same time, "Computer" is also queried, and vice versa. The subject implication extension refers to not only querying the search term, but also queries the subsequent concepts included. For example, the keyword "art" includes "movie", "dance", "painting", etc. "Movie" also includes "story film", "documentary", etc. Therefore, the query "art" of course includes "movie", "dance", "painting", and the subsequent under the subsequent. Improve the accuracy of the information retrieval, and can be implemented using the vector space model to perform related query feedback processing. That is, the user from the first query result, select the content important document or document segment, so that the search engine re-queries according to the characteristics of the selected document, thereby increasing the query accuracy. Information Classification and Summary In order to facilitate the user to select the required information from the query result, the search engine can classify the document information provided to the user according to the documentation and generate a short summary for each document. The search engine classifies and summarizes the query results according to the statistical features of the text search.

转载请注明原文地址:https://www.9cbs.com/read-37832.html

9cbs

New Post(0)