Source: e800.com.cn
Search Engine System Architecture This mainly describes the system architecture of the full text search engine. The search engine mentioned below does not have a special instructions. The implementation principle of the search engine, can be seen as four steps: from the Internet to grab the web page → establish the index database → Search in the index database → Processing and sorting the search results. 1. From the Internet, the webpage can be used to automatically collect web spider programs from the Internet, automatically access the Internet, and climb all the URLs in any web page to other web pages, repeat this process, and climb all the webpages Collect the server. 2. Establish an index database to analyze the collected webpage by the index system program, extract related web information (including the URL of the web page, the coding type, page content containing keywords, keyword position, time, size, with other webpages Link relationship, etc.), a large amount of complex calculations according to a certain correlation algorithm, obtain each web page for the correlation between the page content and each keyword in the hyperlink, and then establishes a web index database with these related information . 3. Search in the index database After the user enters keyword search, break down the search request, find all relevant web pages that match the keyword from the web index database. 4. Sort search results All related web pages For this keyword, the relevant information is recorded in the index library, just consolidate related information and web level formation correlation value, then sort, the higher the correlation, the ranking Before. Finally, the page generating system will organize the link address of the search results and the page content summary to the user. The following figure is a typical search engine system architectural diagram, and each part of the search engine is interlaced with each other. The processing process is described below: "Network Spider" grabbed the web page from the Internet, send the webpage into the "Page Database", "extract URL" from the web page, send URL to "URL Database", "Spider Control" to get the web page URL, control "network spider" to grab other web pages, repeatedly cycle until all the webpages are caught. The system gets text information from the "Web Database", and the "Text Index" module establishes an index to form an "index database". At the same time, "link information extraction" is performed, and the link information (including anchor text, link itself, etc.) is sent to the "link database" to provide a basis for the "web rating". "User" gives "Query Server" by submitting the query request, the server performs the search in the "Index Database", and "Web Rating" combines the query request and the link information to relevance to the search results, through " Query Server "Sort by correlation and extracts the content summary of keywords, and the final page returns to" User ".