Search Engine Technical Introduction

xiaoxiao2021-03-06  108

1. System Structure: Caught web page establishment index library Search in the index library Sorted data structural retrore indexing List This data structure is widely used in today's various information retrieval, including Web search engines. In the system. Its typical overall structure, as shown below: The Page Rank algorithm is based on the following two premise: one: a web page is multiple references, it may be very important; although a web page is not quoted multiple times, it is important The webpage reference, it may also be important; the importance of a web page is averaged to the web page it referenced. This important web page is called an authoritative page. Prerequisit 2: Assuming that the user starts random access to a web page in a web page, follow the way to the web page to browse the web, do not roll back, browse the probability of the next page is the page rank value of the browsing web page. First of all, PageRank is not a single level, but calculates a single page. Second, the page A PageRANK value depends on the recursive value of the PageRank of the page connected to a page. The PR (Ti) value is not equal to the page Pr (A). In the calculation formula of PageRank, T is also affected by the number of outbound links C (T) of T. That is to say, the more the outbound link of T, the less influence of this connection of T is. Pr (a) is the sum of all PR (Ti). Therefore, for A, add a PR (A) every incoming link. Finally, all PR (Ti) is multiplied by a damping coefficient D, and its value is between 0 and 1. Therefore, the use of damping coefficients reduces other pages for the sorting contribution of current page A. PageRank algorithm 2 (revised algorithm 1) Pr (a) = (1-D) / N D (PR (T1) / C (T1) Pr (TN) / C (TN)) N is the number of all the web pages on the Internet, thereby, a probability distribution formed by the web level of all pages, the sum of the web levels of all pages is 1. In algorithm 1, the random surfing the probability of accessing a page is determined by the total number of the Internet, in the algorithm 2, the web level is the expected value of a page is randomly accessed. All pages of page levels are equal to the total number of Internets. In the case where the number of webpages is small, the web level equation can be solved, and it is impossible to face the Internet billions of bills. The damping coefficient is 0.5 here, although Lawrence Page and Sergey Brin are set to 0.85. Pr (a) = 0.5 0.5 Pr (c) Pr (b) = 0.5 0.5 (Pr (a) / 2) Pr (c) = 0.5 0.5 (Pr (A) / 2 Pr (b)) solution: Pr (a) = 14/13 = 1.07692308PR (b) = 10/13 = 0.76923077PR (c) = 15 / 13 = 1.15384615: Pr (a) pr (b) pr (c) = 3 iterative calculations PageRank Google uses an approximate iterative method to calculate web page levels, that is, give each web page initial Value, then use the above formula, the loop is subjected to a limited number of web levels. According to the article published by Lawrence Page and Sergey Brin, they actually need 100 iterations to get the satisfactory web level value of the entire Internet. The examples of this are only 10 times. In the iterative process, the web level of each web page is the number of pages that converge through the entire network.

Therefore, the average web level of each page is 1, and the actual value is between (1-D) and (DN (1-D)). Iterations PR (A) PR (B) PR (C) 0 1 1 1 1 1 0.75 1.125 2 1.0625 0.765625 1.1484375 3 1.07421875 0.76855469 1.15283203 4 1.07641602 0.76910400 1.15365601 5 1.07682800 0.76920700 1.15381050 6 1.07690525 0.76922631 1.15383947 7 1.07691973 0.76922993 1.15384490 8 1.07692245 0.76923061 1.15384592 9 1.07692296 0.76923074 1.15384611 1.07692305 0.76923076 1.15384615 10 11 1.07692307 0.76923077 1.15384615 1.07692308 0.76923077 1.15384615 12 value is the average of the contribution for the rights of outgoing links PageRank algorithm, which is the importance of different links is not considered. The web link has the following features: 1. Some links have annotations, and some links are navigation or advertising. An annotated link is used for authoritative judgment. 2. Based on commercial or competitive factors, there are very few Web page to point to the authoritative web page of its competition. 3. The authoritative webpage rarely has an explicit description, such as the Google Homepage does not explicitly give the description information such as the web search engine. It can be seen that the average distribution weight does not meet the actual situation of the link. Another web page introduced in the HITS algorithm proposed by J. Kleinberg, called the HUB page, the HUB page is a web page that provides a collection of authoritative web links, which may not be important, or there are few web pages point to it, but The HUB page confirms a link collection that points to a site that is the most important site on a subject, a list of recommended references than a course home page. In general, a good HUB page points to a lot of good authority web pages; a good authority web page is a web page with a lot of good HUB page points. This hub and the AUTHORITIVE web page can be used in the discovery of the authoritative webpage and the automatic discovery of web structures and resources, which is the basic idea of ​​the Hub / Authority method. The HITS (HYPERLINK-INDUCED TOPIC Search) algorithm is the search method of the Hub / Authority method, the algorithm is as follows: Submit Query Q to a traditional key-based search engine. The search engine returns a lot of web pages, from the previous N web pages as a root set, and uses S. S satisfies the following three conditions: 1. The number of web pages in S is relatively small 2. Most of the web page is a web page 3 related to Query Q. The web page contains more authoritative web pages. The S is expanded into a larger collection T by adding the web page referenced by the S-reference to the S-reference page and the reference S. The HUB page in t is set VL, with the authoritative web page set V2, the page of the webpage in VL to the web page of V2 is the side set E, forming a two-point map SG = (V1, V2, E ). For any of V1, the HUB value of the web V is used to represent the HUB value of the web V2, and the vertices U in V2 are used to represent the AUTHORITY value of the web page.

At the beginning, H (v) = a (u) = 1, the I operation modifies its A (U), and modifies its H (V), then standardizes A (U), H (V) ), The following operations I, O until A (U), H (V) are converged until A (U), H (V). The equation (1) reflects that if a web page is pointed out by a lot of good HUB, its authority value will increase accordingly (ie, the authority value is added to all the existing Hub values ​​of the web page to it). The equation (2) reflects that if a web page points to many good authority pages, the HUB value will also increase accordingly (ie the HUB value is added to all the web pages of the web page link). The PageRank and Hits described above are the most widely used link analysis algorithms. Both algorithms are based on two assumptions: links to pass the author's recognition. If there is a link from page A to page B, A and B's authors are different, then the author of A finds B useful. So the importance of the page can be propagated to the page linked to it. The page that is referenced simultaneously in a page is likely to tell the same topic. However, these two assumptions are a typical example in which it is not established is not established. See the page below (http://news.yahoo.com) page contains a variety of different semantics (out of rectangular frames of different colors) ) Also included a lot of links (left rectangles) for advertising and navigation (left rectangles), the importance of many pages is very likely to be calculated by PageRank or may also cause the HITS algorithm's topic drift. The reason for these two issues is that every single page often contains a lot of semantics, and different parts in the page generally have different importance levels. From the perspective of semantic segmentation, the page should not be a minimum unit. The included links in different semantic blocks typically point to the page of different topics. Natural, these semantic blocks should be used as the minimum unit of information. The following describes a novel visual page split method VIPS (Vision-based page segmentation) people browsing the web is rendered by the browser. These include many visible tips to help people distinguish between the various parts of the page (such as Lines, Blanks, Images, Colors ...) in order to browse the simplicity and understandability, generally, each closed block is a single Semantics. Speaking of semantics analysis, here, please find out that traditional semantic analysis is based on content, the speed is very slow, and it is difficult and not accurate. VIPS discards semantic analysis, and divides the page into blocks on the page's visual tips. This process is similar to the visual understanding of the page distribution. This introduces two new link analysis algorithms: Block Level PageRank (BLPR) Block Level Hits (BLHITS) First: Decompose the page into a representation block. The tree structure, each node represents a block and is assigned a value DOC (consistency) to represent the consistency of the content in this block on the visual understanding. Segmentation method: First establish an HTML DOM tree, find each separator (such as horizontal line, vertical line) in a block divided into DOM tree.

Specific details Reference D. CAI, S. Yu, J.-R. Wen, And W.-Y. Ma, VIPS: A VISIONBASED PAGE Segmentation Algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003. Based on the above Two matrices, we can build two graph models: page diagraph GP (VP, EP, WP), and block diagrams GB (VB, EB, WB). Each figure, V is vertex collection (Page, Block, Respectively ), E is a combination of two vertices, W is to define the weight matrix of these edges block Block-Level PageRank takes A as the weight matrix of Figure G, first put the sum of the sum of the sum to 1, get one Jump possibilities matrix M. You can then define a model: Each viewer is coming over a page, randomly picks up a link and click Access from the current page with (1-ε) probability, with the probability of randomness in the favorites with ε. Pick another URL access.

转载请注明原文地址:https://www.9cbs.com/read-103636.html

New Post(0)