Summary of Web Search Engine Technology

xiaoxiao2021-03-06 26

Summary

With the rapid development of network and communication technology, the growth of web information explosion has become a huge massive information space. How to get fast, accurate, and convenient from such a huge information library, it is an important issue facing Internet users. The web search engine can provide users with a service that looks for the resources required, which has become the second largest service on the Internet. This article first introduces the principles and implementation techniques of the search engine. Then discussed the latest frontier technology for search engine technology development. Finally, combine the author in this regard, give the search engine's recent development direction.

Key words

Web, search engine, network technology

I. Introduction

With the application and development of network technology, the interconnection network has become an important source of information. By the end of 1999, there were at least 16 million hosts to access the interconnect network, and the total number of Internet on the Internet has reached 1 billion pages, and increasing at near 10 million months [1]. The search engine collects, discovers information, understands, discovers information, understands, extracts, organizes, and processing information, and provides users with the purpose of information navigation, and Internet users use the network to acquire information. Search engine also becomes an essential tool. The survey shows that in all internet applications, network information search is second only to email, and most of these search is specialized, highly complex search engines implementation.

According to the information collection method and service providing method, the search engine system can be divided into three major categories: directory search engines, collect information in manual or semi-automatic mode, after viewing information by the editor, manually forming information summary, and will information Placed in prior classification framework, due to the massive and manual processing power, economic cost limit, this type of search engine information is difficult to ensure that its excellent representative is Yahoo [3], etc .; Robot The search engine is automatically collected and discovered in the Internet by a robot program called a spider (Spider), and the index is collected in the Internet. The index is established to establish an index, and the search device is retrieved by the user's query input. And return the query results to the user, such search engines achieve more complicated, but it is possible to achieve comprehensive acquisition and instant update of information, its excellent representative is Google [4], etc., the following text is as non-special instructions, Refers to this type of search engine; Yuan Search Engine, this type of search engine does not have its own data, but to submit the user's query request to multiple search engines, repeat the result of returning, re-order, etc. The result is returned to the user. This type of search engine is combined with multiple search engines, and add new sorting and information filtering, which can improve user satisfaction, and its excellent representative is Vivisino [5].

The organization is as follows: The second part introduces the principle and implementation technology of the web search engine; the third part introduces the latest development dynamics and frontier technology of the web search engine; the fourth part gives the basic prospects. Finally, conclusions are given.

II. Principle, implementation and evaluation of the web search engine

The principle of the web search engine is usually: first is a spider (Spider) to search, automatically capture the web page; then the captured web page index, and also record the properties related to the retrieval, the Chinese search engine also needs First, word in Chinese; Finally, accept user query requests, retrieve index files, and perform complex calculations in accordance with various parameters, resulting in results and return to users.

Based on the above principle, the Web Search Engine is implemented below.

1. Use the web spider to get network resources.

This is a semi-automated resource (because the resources have not been analyzed and understood in this time, it is not only a resource only). Semi-active, refers to the searcher needs to manually specify the starting network resource URL (Uniform Resource Locator), then get the network resources pointed to by the URL and analyze other resources pointed to by the resource and get it. The basic flow charts are as follows: Figure 1, basic searcher flow chart

The process of network spiders access the resource is the process of traversing the information on the Internet. In the actual spider program, in order to ensure the comprehensiveness of information collection, timely, there are multiple spider procedures, and cooperation, often have complex control mechanisms. If Google's use of spider programs to obtain network resources, it is handled by a distribution and result of the management program, and multiple distributed spider programs from the management program activity task, and then returns the obtained resources as the result, and Get a task from the new task [6].

2. Use the indexer to extract information from the resource acquired by the search, and establish an index table that is conservatively retrieved:

When you get the resource with a network spider, you need to filter these, remove the network control code and useless information, and extract useful information, and use a certain model to make query results more accurate. The representation of the information is generally a Boolean model, a vector model, a probability model, and a neural network model. [7]. The information on the web is generally manifested as a web page. For each page, a summary is required. This summary will appear in the page of the query results, telling the content summary of the user's respective web pages. Modelization information will be stored in a temporary database, because the amount of data of web data is extremely large, in order to improve the search efficiency, it is necessary to establish an index according to certain rules. Different search engines consider different options when establishing an index, if a full-text index is established, whether it is filtering the use of no vocabulary, whether META information is used. The establishment of the index includes: analysis process, processing possible errors in the document; document indexing, complete analysis documents are encoded into storage buckets, some search engines use parallel index; sort, order the bucket, sort the bucket, produce full text Bucket. The final index is usually stored in the format of the inverted file.

3. Search and user interaction:

The front two parts belong to the background support of the search engine. On the basis of the previous information index library, accept user query requests and retrieve relevant content to the index library, return to the user. The main contents of this part include: User Query, that is, the biggest possible understanding of the user's inquiry is designed by querying the query of the string, and converts the user query into the information model used in the background retrieval; Model, retrieves the result set in the index library; the result is sorted by a specific sort algorithm, sorting the search result set. The sorting factor now used generally inquiry correlation, Google's PageRank, BAIDU's bidding technology, etc. Due to the massive and user initial query of the user's initial query, the search result set is generally large, and the user does not have enough tolerance to view all the results one by one, so how to design the result set algorithm, interested users The result is very important in front.

The evaluation index of the search engine has a response time, checks the rate, the ratio rate, and user satisfaction. The response time is the time interval from the user to the query request to the search engine to give the query result, the response time must be within the scope of the user. Checking all rate refers to the completeness of the query result set information. The ratio rate refers to the ratio of the number of query results in line with the number of users required by the user. User satisfaction is a difficult concept, in addition to the service quality of the search engine itself, it also has a user group and network environment. Within the search engine, its core is sorted by search results, i.e., how to put the most suitable result to the front. 3. The latest news of the web search engine

Currently, search engine technology has tended to be numbed, and user satisfaction is also maintained in an acceptable level. In terms of information collection technology, indexing technology, retrieval technology, and results set, in recent years, in addition to Google Creativity Proposed PageRank technology, it is basically not sorted, there is basically no breakthrough. The research and information integration of search engines is gradually integrated. In this regard, research is mainly concentrated in two aspects: query expansion and results set dynamic classification.

1. Query expansion:

Since the user uses the search engine to find information, it is often not possible to use the standard that the search engine is provided accurately to find things you want to find, so that Query Expansion will be required before the user query request is retrieved. The query extension includes two steps: extend the initial query string with a new keyword; the keywords in the query string after the extension are allocated from the new weight. The query expansion is divided into three categories: based on user registered interests; however, based on the user's result set operation feedback information [7]; mode based on the overall information of the search results document set [8]. These methods extend the user initial query by different ways, to increase the user closing of the query results. The user's interest in registration is the most accurate and easy to implement, but the user must be registered, and the user is recognized and trust in this way; the user feedback information mining method, the user does not need any additional operation, but the search engine work The amount will be greater, the accuracy of the excavation is also difficult to control, and the excavation itself also involves the problem of user privacy license; in the way of the search results document collection, there is already a good implementation, but because not for specific users And the search engine is most needed to achieve personalized search, so there is no direct assistance for users of search results.

2. Dynamic classification of search results:

Since the result set is usually very large, how to organize the results set form, so that users quickly find the information needed to become a very critical issue. Although by improving the page sorting algorithm, the "important" page can appear in front of the return result, but it is difficult for all users to accept the service providers due to the differences in user occupations, interests, age, etc.. Sequential order. In addition, statistics show that users generally do not turn over five pages in the result set. Therefore, the results of the query are organized in a certain category, allowing users to easily select the category, which can narrow the result set, so that users can find information faster.

The main contents of this study include how to determine the category level? How to determine the category belonging to the article? The existing solution can be roughly divided into two: (a) According to the experience or certain calculation model, the static category level is determined in advance, and the semantic analysis of the web page determines the category. The main problem is: Due to the unpredictability of the user's region, occupation, religion, education, it is difficult to determine the accepted classification hierarchy of all users; because the category level is static, the classification of the article Avoid limiting, so that some web pages do not have the right category to be released; in addition, the semantic analysis technology accuracy is currently based on natural language understanding, the system consumption is too large, because the number of web pages is large, so each web page It is unacceptable to consumption consumption that requires high processing. (b) On-the-fly clustering of the query result set, that is, in the retrieval result set and the result set show, first, the similarity calculation algorithm for the results set, then The results of the results set, and the user can select dynamically produced clustering categories to narrow down the result set, thereby quickly finding information required. This way is simple and flexible and is easy to implement. However, the dynamically generated category is not a good constitution system, and the hierarchical relationship is more difficult to reflect, so how to design clustering algorithms, directly related to clustering results. IV. Talking about the development trend of the web search engine

With the speed of current information technology, anyone should predict the development trend of technology, will only laugh at generous. Specific to Web Search Engine Technology, the author can only develop in the near future according to the current research results and research direction, give a personal opinion.

The currently two aspects of the current search engine service are mainly coming, one is the unauthorized butbore of the user's initial query request, and users often express their information needs. Therefore, during the current and future time, the study of how to understand the actual information needs of the user according to the user's fuzzy query request, more accurate, comprehensive understanding of the actual information needs of the search engine technology will be an important aspect of the search engine technology research. The current research is mainly focused on the information mining of the query results documentation, and this effort has achieved a lot of results. On the other hand, the personalized information of the user will be a way to generate breakthrough results: according to the background information of the user, such as knowledge level, professional direction, occupation, interest, etc., will greatly promote the initial query understanding. Accuracy and comprehensiveness. In addition, the log analysis used by the user network can also provide accurate user network behavior, providing an important basis for user information demand analysis. In summary, the implementation of personalized search engines will become the focus and breakthrough points of recent time research.

Another challenge for search engine service quality is that the query result set is too large, and the user often does not see information on the results of the results. The solution that has been fundamentally determined in this regard is that the scope of the result is selected based on the classification system organizational result set. Its implementation is the determination of the classification system and the determination of the final page home category. The current research is mainly based on the understanding of the information collected, limited to the current natural language understanding, large overhead, and the research is difficult to have a bigger breakthrough. Another option is to let the information publisher provide the category of the information, and describe the category information and semantic information of the information with a unified specification. The search engine can only be obtained, undoubtedly, the categories and semantic information derived in this way are the most accurate. For example, the current general website is built with a navigation bar, the website content security navigation bar is hierarchical, so that navigation information can be used as a category level, and the navigation items only want the final page to belong to this category. The focus of the author is how to extract the website column information, how to divide the website according to the column, thereby corresponding to the columns and scope, specify all content within this range to the categories referred to the corresponding column item. In the future, this research must be released in this regard if all the information publisher specification describes the category and semantic information. From the above discussion, it is not difficult to see that the research of search engine technology will focus on how to provide personalized services [9], how to provide information category and semantic understanding based on information publishers.

V. conclusion

This article describes the principles and implementation technology of the web search engine, discussing the latest developments in the current web search engine research, and discusses the direction of the development of the recent Web Search Engine.

references:

[1] "World Wide Web Search Technoligies" shi nansi, IDEA Group Publish

[2] "Search Engine Technology and Trends" Li Xiaoming, Liu Jianguo. 2003.6

[3] http://www.yahoo.com.

[4] http://www.google.com.

[5] http://www.vivisino.com.

[6] "Search Engine and Information Acquisition Technology" P107. Xu Baowen, Zhang Weifeng; Tsinghua University Press.

[6] http://www.searchenginewatch.com.

[7] Conceptual Retrieval Based on Feature Clustering of Documents. Youjin chang, ikkyu cho.

[8] Modern Information Retrieval, P117. Addison Wesley. 1999

[9] Microsoft Unveils ITS New Search Engine -at Last, Chris Sherman, 2004.11

Author blog: http://blog.9cbs.net/hwalk/

转载请注明原文地址:https://www.9cbs.com/read-45522.html

9cbs

New Post(0)