Search engine principle

xiaoxiao2021-03-05  210

Search engines don't really search the Internet, and it is actually a pre-organized web index database. The true search engine, usually refers to collecting tens of millions of billions of web pages on the Internet and indexing each word (ie, keyword) in the web page, and establish a full-text search engine for indexing databases. When the user finds a keyword, all pages containing this keyword in the page will be searched as search results. After being sorted by a complex algorithm, these results will be arranged in turn according to the correlation with the search keyword. The current search engine has generally used hyperlink analysis techniques, in addition to the content of the index page itself, also analyzing all the URLs, Anchortext, and even links around the link. So, sometimes, even if there is no word in a web A, if there is "Devil Satan", if there is another webb page B use the link "Devil Satan" point to this web A, then the user can search "Devil Satan" can also Find web a. Moreover, if there are more web pages (C, D, E, F ...) use the link named "Devil Satan" to point to this web a, or give this link (B, C, D, E, F ...... Better, then web page A will also be more relevant when the user searches "Devil Satan", and the sequencing will also be more dependent. The principle of search engines can be seen as three steps: grabbound web pages from the Internet → establish index database → Search in the index database. From the Internet, you can automatically access the Internet from the Internet to automatically collect web pages from the Internet, and climb to other web pages along any URL in any web page, repeat this process, and collect all the webpages climbed back. . Establishing an index database is analyzed by the analysis index system program to analyze the collected web pages, extract related web information (including the URL in which the web page, the coding type, the page content contains keywords, keyword positions, generation time, size, links with other web pages Relationship, etc.), with a large number of complex calculations based on a certain correlation algorithm, obtain each web page for the correlation between the page content and each keyword in the hyperlink, and then establishes a web index database with these related information. Searching in the index database Sorting When the user enters a keyword search, find all related web pages that match the keyword from the web index database. Because all related web pages have been well calculated for the correlation of this keyword, just sort according to the high-class correlation value, the higher the correlation, the more the rankings are Finally, the page generation system will organize the link address of the search results and the page content summary of the page content summary to return to the user. SPIDER's Spider generally regularly re-accesses all web pages (each search engine is different, may be a few days, weeks or months, may also have different update frequencies for different importance of web pages), update the web index database, In order to reflect the update of web content, add new web information, remove the dead link, and reorder according to the change of the web content and link relationship. In this way, the specific content and variation of the web page will be reflected in the results of the user query. Although there is only one, the capacity and preference of each search engine are different, so the webpages are different, and the sorting algorithms are different. The database of large search engines stores billions of billions of billions of web supplies on the Internet, with data volume reaches a few thousand g or even tens of thousands of G. But even the largest search engine establishes an index database of more than 2 billion web pages, only less than 30% of the Internet on the Internet, and the webpage data overlap rate between different search engines is generally below 70%. We use different search engines for important reasons because they can search different content separately. And there is a greater content on the Internet, which is the search engine that cannot capture the index, and we cannot search for the search engine.

转载请注明原文地址:https://www.9cbs.com/read-31993.html

New Post(0)