According to the statistics of China Internet Information Center, the current search engine is the application of the second largest Internet only after email. The research and application of search engine technology is overweight, which is the object of computer industry and academic competition research and development.
Although the search engine has a wide variety, the features are different, but their overall structure and basic working principles are the same.
Each search engine has three parts: (1) "robot" to collect information on the Internet; (2) Indexing the collected information and establishes an indexer of the index library; (3) Complete the user submit query request Web search device. Figure 1.1 Search Engine Overall Structure Figure 1, "Robot" "Robot" is actually a web-based program, by requesting the HTML web page on the Web site to collect the HTML web page, it traverses the entire web space within the specified range, constantly Go to another from a web page to another site from a site to another, add the collected web page to the web database. Each time you encounter a new web page, you must search all links inside, so in theory, if you create an appropriate initial web page set for the "Robot", start from this initial web page, traversing all links "Robot" will be able to collect web pages from the entire web space. 2, the indexer indexer is responsible for establishing the "Robot" to establish an index and stores in an index database, and the index database can use universal large databases such as Oracle, Sybase, etc., you can also define your file format. . In order to ensure the synchronization of the index database information and the web content, the index database must be updated timed, and the update frequency determines the timeliness of the search results, the index database update is achieved by starting the "robot" to re-search for Web space. 3. Web Stripes When the user uses the search engine to find information, the web search device receives the query condition submitted by the user and the response to the user query request. The web search is a CGI program running on a web server. It first receives the query condition submitted by the user, and the index library is found according to the query criteria and returns the query to the user. Some systems have calculated and evaluated the correlation of the webpage before returning the result, and sorted according to the correlation, the correlation is large, and the correlation is small; and some systems have been in the user query. The web page ranks of each page are calculated. When the query result is returned, the web page level is large, and the web page level is small behind. Typical, Ding Ding Google (http://www.google.com) is taking this strategy, its PageRank method also has international patents, because PageRank's objectivity, so that the result is fair, not Will use the company to use the trick, resulting in the front side of its webpage in the front side of the result, but based on the calculated level value, the corresponding position. The most well-made Baidu (www.baidu.com) is currently doing in China, and its speed is not as good as Google, although its information coverage does not meet the size of Google. (Note: Google was first developed by several doctoral students in the computer science department of Stanford University, currently about 2 billion web pages, supporting most popular languages in the world, in this regard, Baidu is quite as good as ).