Deep Analysis of Full Technical Search Technology
Xiao Shibin
Bring all the text sequences in the article as a search object, find out the article containing the vocabulary, this is the full text search. The full text search technology is divided into two categories according to the field of use: Internet search engines and enterprise content retrieval systems. The core technologies used by the two are all in full text, but the focus is different. The two categories of the Internet Search Technology Internet Search Engine is the main purpose of the large number of messy web pages on the Internet. The main purpose is to find some useful reference information and some harmful information, try to raise the useful web page in front. Regarding usefulness, there are various metric methods, such as how much is based on the importance of the Pagelink page, or determine the fundamental bidding system based on the money. The representative system representative of the former has Baidu in the latter system of Google. The common feature of the Internet search engine is not to pursue the high level of compliance and high level of high level of enterprise application requirements. In addition, due to the large amount of data, the background index and retrieval database require powerful hardware server groups to support hundreds of thousands of servers, even 10,000 sets, and good architecture design and hardware server selection suitable for full-text search systems. . This is also why successful Internet search engines provide services in ASP, and why the Internet Search Engine Manufacturer's full-text search system is used to fail to fail. The reason why the powerful function of the Internet search engine is supported by the powerful hardware server group and the appropriate architecture of the background, not because of its own technology. Furthermore, the high level of the enterprise application requirements and the high level of high level, and the real-time information of information update is difficult. Enterprise Content Search System Requirements Query results have high levels of real-time rate and high level ratio and information update. Enterprise information is a useful information that is organized. The high level is required to be highly found at the same time. The purpose of the high level is to let the user save time to find useful information early; the high level is in order to fully analyze the information and not overturned the business opportunity. Because of this reason, the Internet search engine is difficult to use in enterprise applications in order to improve the estimation techniques commonly used by the query speed. Not retrieving information in all servers, just retrieving information in some of the corresponding servers and then estimating the total number of inventive expectations based on experience estimation when returning the first retrieval result. Enterprises are collected after the intelligence hopes that its users can retrieve it, rather than returns a long delay. The Internet Search Engine and Enterprise Content Retrieval Systems are also: the Internet search engine system information from the file system HTML files include some dynamic web pages. In addition to the HTML file stored in the file system, the enterprise content is stored in a variety of information in various relational databases, even directly in the full-text search system. This requires that the company's content retrieval system and relational database has good interface requires enterprise content retrieval system itself to manage various data like relational database management systems. It is difficult to successfully apply to enterprises due to these flaws of the Internet search engine system. Full-text retrieval system Surface view full-text search system To index all the text sequences in the article, in order to find articles containing the vocabulary. The full-text search system first divides the content to be retrieved into shorter text sequences and then generates an index of the string included in each text sequence. After entering the search statement, segmentation is also divided, compared to the index. That is, even if the same text is included, the two cannot be properly retrieved. There are two main sequences of text sequences. There are two main types: morphexic analysis and N-gram. Ratin analysis refers to decomposing the text sequence according to the smallest unit in the meaning of dictionary. The N-gram relative to this is not considered, and only the article is divided by a certain length unit n. After text division is performed according to the morphemor resolution, it can be retrieved according to meaningful words. For only partial text, the text sequence is excluded, therefore, the retrieval interference is reduced. However, it will not perform correct segmentation when there is no words in the dictionary. So there is a possibility of retrieving omissions. Conversely, if n-gram is used, the case where the retrieval is not retrieved, but the search interference is added. Each of them has advantages and disadvantages, which generally uses one of them, but there is more parsing using words.