Brief talking on search engine workflow
The Internet is a treasure house, the search engine is a key to open the treasure house. However, the vast majority of netizens have insufficient knowledge and skills of search engines. A survey of abroad shows that about 71% of people have disappointed the results of the search. As the second largest service of the Internet, this situation should change. The rapid development of the Internet has led to the explosive growth of online information. The world's current page is more than 2 billion, and 7.3 million web pages are added daily. To find information in such a vast information ocean, just as the "sea fishing needle" is as difficult. The search engine is to solve the technology that appears in this "value" problem. The search engine work includes the following three processes:
1. Discover, collect web information in the interconnection;
2. Extraction and organize the information to establish an index library for information;
3. Then, according to the query criterion input by the user, quickly detect the document in the index library, perform the correlation evaluation of the document and query, sort the result to output, and return the query results to the user.
Discover, collect web information
High-performance "Network Spider" program (Spider) is required to automatically search information in the Internet. A typical network spider works, is a page to view a page, and find relevant information, then it will start from all links in the page and continue to find relevant information, and so on until exhaust. Network spider requirements can be quickly and comprehensive. The web spider provides its fast browsing the entire Internet, usually using a pre-standard multi-threaded technology to accumulate information online. By preemptive multi-threaded use, you can index a URL link-based web page, start a new thread follows each new URL link, index a new URL starting point. Of course, the threads driven on the server cannot be infinitely expanded, requiring a balance point between the normal operation of the server and the fast collection page. The various search engine technology companies in algorithms may not be the same, but the purpose is to quickly browse the web page and subsequent processes. In the current domestic search engine technology company, such as Baidu company's network spider uses a customizable, high-scalability scheduling algorithm that allows the search to collect the maximum number of Internet information in a very short period of time and put the information obtained. Save it to set up index libraries and user retrieves.
Establishment of the index library
Relationship to whether the user can quickly find the most accurate and extensive information, and the index library must also quickly establish an index, and ensure the timeliness of the information. The webpage is used to evaluate the relevant web page content analysis and the combination of ultra-chain analysis, which can be sorted objectively, thereby greatly ensuring that the results of the search are consistent with the user's query string. The Sina search engine has taken the index of the website data to establish an index of the website title, website description, website URL, or the quality level, etc., to ensure the search result and user query The string is consistent. In the process of index library, Sina Search Engine, the Sina search engine is in parallel to all data. The new information is used to establish an index to establish an index library to ensure that the index can be quickly established, so that the data can get a timely update. Sina search engine tracks the user search in the process of establishing the index library, and establishs a Cache page for a query string with high query frequency.
User retrieval process