Google's Secret - PageRank thoroughly explained Chinese version

xiaoxiao2021-03-05  48

Google's Secret - PageRank thoroughly explained Chinese version of the original:

Google Secret - PageRank thoroughly explained hajime baba /

Race

translation:

KRENY / Yuan Huanglin

Dalouis.com>

Created in: 2003/12 Last Update: October 28, 2004 3:53 Keywords: PageRank, Google, Link

Translation Description: The translation of some statements uses the translation, making it as possible to comply with Chinese understanding and explanation.

Copyright Notice: You can reprint anything, please be sure to indicate the original source and author information and this statement by hyperlink.

Http://linux.dalouis.com/pagerantan_cn.htm

Back to home page

This article is a search engine as a high evaluation.

One of the basic concepts and evaluation principles of one of the core technologies of PageRank (web level).

index

Foreword PageRank's basic concept How to find the problem of PageRank actual application Namazu on the actual installation experiment on PageRank's personal insights reference: "gunguru? / Gouguru?"

★ (2003/7/1) The name of the "Namazu system is constructed and used" has been revised. Please see the introduction page for details.

★ (2003/5/20) The online news report (Japanese) related to Google has been separated to another page (Googlenews.html).

★ (2001/2/28) Namazu's index used to calculate PERANK's Perl script prNMZ-1.0.tar.gz public download.

1 Introduction

Recently, the search engine google (http://www.google.com/) is very eye-catching. Google is a retrieval service based on the Search Engine developed by the Search BRIN (February 2001) based on the CEO-based Larry Page (February 2001). Google starts service from September 1998, but Netscape Communications starts with its cooperation in Google's test phase. The Inktomi, which is originally collabted into Google. The Japanese version Google officially debuted in September 2000 and is now adopted by Biglobe (NEC). (Note: In April 2001 Yahoo! japan and @ nifty, Sony in July, January 2002 Excite also successively established collaborative relationship with Google).

The advantage of Google's evaluation is not only to remove useless (advertisement) slogans constitute a single page of the function, but a single Cache system, dynamically made a summary information, a dispersion system set by high-speed retrieval (thousands of LINUX cluster ), Etc., the biggest advantage is the correctness of its search results. A technique that automatically judges the importance of web page "PageRank is (web page level)" is a technology designed for this. The purpose of this article is to explain the summary and principles of the PageRank system in terms of easy-to-understand language as much as possible.

The following is a basic article of PageRank. Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, 'The PageRank Citation Ranking: Bringing ORDER TO The Web', 1998,

http://www-db.stanford.edu/~Backrub/pageRanksub.ps

In order to calculate PageRank more efficiently, the following is a paper after improvement. Taher H. Haveliwala, 'Efficient Computation of PageRank', Stanford Technical Report, 1999, http://dbpubs.stanford.edu: 8090 / Pub / 1999-31

In addition, the following is a PageRank's demonstration information. Larry Page, 'PageRank: Bringing Order to the Web',

http://hci.stanford.edu/~page/papers/pageRank/ (has been invalid)

Next, the two articles (additional information) will be basically described. First, use a simple example to explain the concept of PageRank, and then attribute to the sequencing system using hyperlink relationships to solve the problem of the characteristic value of large-scale loose matrices. Then we will contact some questions and corresponding methods that occur when applying the basic model in the real world. Next, in order to explore whether it can be used as "personal PageRank", the installation experiment of the free full-text retrieval system Namazu is performed and the results are elaborated. Finally, I published my personal opinion on PageRank.

In addition, in order to understand the following description, mathematical knowledge (especially linear algebra) is required (especially linear algebra). However, in order to make the liberal arts can also read the problem as much as possible to explain the problem as much as possible. At the same time, in order to join the author's personal opinion, there is no more algorithms and numbers like the original text, and there are many not strict and correct places. , In advance, in advance. For details, please refer to the original text.

PageRank (TM) is a registered registered trademark of Google, USA.

2. Basic concept of PageRank

PageRank is based on the return relationship between "web pages, must still be a high-quality web page" from many high-quality web links, to determine the importance of all web pages.

In the following lengthy descriptions, many parts of a large number of professional terms will cause understanding. Although this chapter is ready to focus on qualitative and simple explanation, even if so, when you don't understand, you can understand how you can understand "web pages from many high-quality web links, must be a high quality page" Thinking methods are also very valuable. This is the most important thinking method because of all the points.

From Google's own introduction "Google's popular secrets (http://www.google.co.jp/intl/ja/why_use.html)" is the same below.

About PageRank PageRank, effectively utilizes the characteristics of the huge link constructs owned by the Web. The link from the page A-oriented web B is constructed as a support for page A. The Google determines the importance of the page based on this voter. However, Google does not only look at the number of votes (ie the number of links), and the page of the voting is also analyzed. The value of the ticket voted by the "Important" high, because this voting page will be understood as "important items". According to such an analysis, it is given a high evaluation of the important page that will be given higher Page Rank, which will increase in the retrieval result. PageRank is a comprehensive indicator of the importance of webpage in Google, and will not be affected by various retrieval (engines). It is better to say that PageRank is based on the analysis of the analysis of "link constructs" with complex algorithms, thereby the characteristics of each page itself. Of course, there is no meaning of the page with high importance if there is no meaning of the search words. To this end, Google uses a refined text matching technique that enables it to retrieve important and correct pages.

Through the following graphs, we have to see the algorithms just explained. The specific algorithm is, divides the PageRank of a page to the forward link existing in this page, and the resulting value is added to the PageRank of the page pointed to by the page pointed outward, that is, the PageRank of the page is obtained. PageRank concept illustration. (Cited from Page et al. (1998) Figure 2 'Simplified Page Calculation')

Let us look at it in detail. Improve PageRank's key points, there are three.

Reverse link number (simple sense indicator in a simple sense) reverse link from the highly recommended page (with a popular indicator) reverse link source page (selected probability indicator)

First of all, the most basic is that many page links will increase the recommendation. That is to say, "Many page links), the popular page, must be high quality pages." So the number of reverse links as a popularity of the popularity is a natural idea. This is because "link" is a recommended behavior that is seen as "can look at this page / page." However, it is worthy of pride that PageRank's thinking method does not stay in this place.

That is, it is not only through the number of reverse links, but also the reverse link of the recommended higher page with a higher evaluation. At the same time, the link from the total number of pages from the total link is given higher evaluation, and the link from the number of pages from the total link is given lower evaluation. In other words, "Collecting a number of recommended pages recommended by the page, must also be the same as the same page", "Compared with the linked link by the chaos, it is definitely a high quality link. These two judgments are carried out. On the one hand, a regular link from others High level web page will be clarified, on the other hand, links from the bookmarks that have completely do not have access to a bookmark page will be "almost no value (although it is not linked) It is better to be sighful.

Therefore, if you are linked from the very high site similar to Yahoo!, only this page will rise at once; vice versa, no matter how much reverse link number, if all are all from those do not make much meaningless If the page link came over, PageRank will not rise easily. Not only Yahoo !, in a certain area can be called a reverse link from the authoritative (or fixed) page. It is very beneficial. However, it is just a link to make some of my own companions, such as the fact that "simple internal care" is difficult to see what value. That is, it is truly valuable to judge (your webpage) from the viewpoint of watching all the web pages worldwide.

These indicators are combined, and eventually form a search structure that will be relatively retrofitted by the higher evaluation of the page.

The past practices simply use the number of reverse links to evaluate the importance of the page, but the advantage of the way PageRank is unacceptable to the mechanical link. That is, in order to improve the reverse link of the high quality page to improve the PageRank. For example, if you delegate Yahoo! to log in to your own website, you will have a sudden rise in PageRank. But for this purpose, you must work to make (web pages) enrichment. In this way, it is made substantially not to improve the proximity of PageRank (or the back door). Not only limited to PageRank (Clever and Hits, etc.), in the sorting system with link constructs, previously pure SPAM methods will not be common. This is the biggest advantage, and the biggest reason for Google is convenient for use. (Although it is the biggest reason, it is not the only reason.)

Note here, PageRank itself is quantified by Google, and the expression of the user retrieved content is completely independent. Just like the back is about to be explained, the search statement does not present in PageRank's own calculation formula. No matter how much search statement, PageRank is also a certain amount of score inherent.

The qualitative statement of PageRank is roughly such a few. However, in order to actually calculate the order, the comparison level requires a more quantitative discussion. The following chapter will make a detailed description. 3. How to get PageRank

What we are interested is that when there is a cross-link configuration, it is quantified which page is "important". In other words, this is the process of strictly calculating this indicator of "which page should be read from which page should be started". Even if you don't read the small page, there is no way.

Then, generally, in order to make the hyperlink configuration like the web can be reflected in the order of the order, there is a need to establish a digital model of the hyperlink configuration on the computer. How to model how to depends on the policy of the installer, so that if the chart theory is applied to observe the hyperlink structure, it will eventually return to the line algebra. This is also the same for PageRank.

Calculation method

As the most basic consideration, it is to express link relationships in the form of row array. When you link from the page i to another page J, the component is defined as 1, and it is defined as 0. That is, the ingredient Aij of the row array A can be used,

Aij = 1 if (from page i to page J "with" link) 0 IF (from page i to page J "No" link)

To represent. If the number of files is represented by n, this row line becomes N × N 's square. This is equivalent to the "adjacency ranks" in the chart theory. That is, the link between the web can be seen as an adjacency of the chart S. All in all, as long as the link is established, there should be an adjacency.

(* Note) The graphic composed of the line connected by the point and point is referred to as "Graph". These points are called "vertex" or "node"; these lines are called "Edge" or "Arc (ARC)". The chart is divided into two categories, "edge" no direction chart is called "undirected graph", "edge" with direction charts are called "Directed Graph". The road to the chart is imagined into one-way traffic. The chart can be represented by a variety of methods, but generally used in the data structure "Adjacency Matrix" and "Adjacency List". It should be noted that if it is a non-map chart, the adjacent row column A is a symmetrical row, and if there is a chart, a will become asymmetric rows.

The following is an adjacency row of the online manual (128 pages) of the Apache represented by a bitmap. When the black dot is arranged laterally, it means that this page has a lot of forward links (ie, a link to which to export); in turn, when the black store is arranged in longitudinally, it means that this page has a lot of reverse links.

Example of adjacent rows (using Apache online manual)

The ranks of PageRank is backward (rows and columns to each other), in order to turn the sum of the column vectors into 1 (full rate), divide each column vector (non-zero) Number of elements). The ranks such as the "Promotion Probability Rows" containing N probability variables, each row vector represents the probability between the state. The reasons for inversion is that PageRank is not paying attention to "How many places" "is" "is". "

The calculation of PageRank is to seek an intrinsic vector (preferably a vector) of the maximum characteristic value of this probability row.

This is because when the linear transformation line T → ∞ is getting up, we can simply describe it fundamentally based on the "absolute value maximum value" and "inherent vector belong to it" depending on the transform line. In other words, the probability process represented by the promotion probability row is a process of multiplying the multiplication of this row, and the probability of the front state can be calculated.

Furthermore, although it sounds difficult, the value of the characteristic value and the inherent vector is a basic mathematical means capable of strictly analyzing. We are able to freely assign a value to the initial value of the vector, but the resulting vector will be concentrated in a combination of some specific values. We refer to the combination of stable values ​​as inherent vectors, refer to the characteristic scalar (scalar) in the intrinsic vector, and the problem of the calculation method is generally referred to as the decomposition characteristic value, referring to the problem of the problem value of the characteristic value is called the characteristic value. problem. (* Note) The number of squares that satisfy AX = = λx is called A, which is the intrinsic vector belonging to λ. If you can't adapt to the concept of the ranks, you can also consider the N × N binary arrangement. At the same time, the vector can be considered as a common (one dollar) arrangement of the length N.

Simple example

Let us use a simple example to try to calculate PageRank by times. First consider 7 HTML files that are like the link relationships shown below. Also, the link relationship between these HTML files is only closed in this 1-7 file. That is, there is no other link in addition to these documents. Also note that all pages have a forward and reverse link (ie there is no end point), which is also an important assumption that will be proposed later, which is not discussed here.

Transfer map showing mutual link relationship between page

First, the adjacent list of this transmissive graph chart is represented as an arrangement, and there is the following form. That is, the ID of the link target is listed in accordance with each link source ID.

Link source I D Link Targets ID1 2, 3, 4, 5, 72 13 1, 2 4 2, 3, 55 1, 3, 4, 6 6 1, 57 5

The adjacent row column A of the link relationship represented in this abutment list is the following 7 × 7 square rows. A Bitmap Matrix is ​​only features (Bitmap Matrix). The horizontal view line represents the file ID from the file I forward link.

A = [0, 1, 1, 1, 1, 0, 1; 1, 0, 0, 0, 0, 0, 0; 1, 1, 0, 0, 0, 0; 0, 1, 1 , 0, 1, 0, 0; 1, 0, 1, 1, 0, 0; 1, 0, 0, 0, 1, 0, 0; 0, 0, 0, 0, 1, 0, 0 ;]

The PageRANK-style brushed probability row M is obtained by dividing the A inverted each numerical value after the respective non-zero elements. That is, the following 7 × 7 square rows. Landscape View 第 第 非 零 要 表示 表示 表示 表示 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 指.... Note that the value of each column is added to 1 (probably).

M = [0, 1, 1/2, 0, 1/4, 1/2, 0; 1/5, 0, 1/2, 1/3, 0, 0, 0; 1/5, 0, 0 , 1/3, 1/4, 0, 0; 1/5, 0, 0, 0, 1/4, 0, 0; 1/5, 0, 0, 1/3, 0, 1/2, 1 ; 0, 0, 0, 0, 1/4, 0, 0; 1/5, 0, 0, 0, 0, 0, 0;]

Indicates the vector R of PageRank (the queue of the number of points of each page), there is a relationship of R = CMR (C to Quate). In this case, R corresponds to the inherent vector in the line algebra, and C corresponds to the countdown of the corresponding characteristic value. In order to find R, as long as the characteristic value of this square row M is decomposed.

There is a variety of numerical analytics methods when decomposing characteristic values, but this article will not be described in detail here. Please read a proper textbook (there must be such a book in your summer vacation). The embedded textbook). Here, we are temporarily using the gnu octave this calculation program actually calculates the characteristic value and inherent vector.

(* Note) Gnu Octave, is a supporting numerical calculation, similar to the programming language of the descriptive excellent MATLAB. The extended processing language is more suitable for rantrifier calculations, but is basically composed to the C language, so the readability is high. For details, please refer to http://www.octave.org/. Of course, in addition to Octave, Matlab and Scilab are also very good languages, but according to GPL, Octave is the easiest. Actual example

Let's take a practical example. If you don't understand what the following example is doing, just think we can use the Octave this program to solve the feature value problem.

First, create the following Octave scripts using the appropriate editor. (In the tail plus the semicolon, you can eliminate the output of excess results, but this time is dedicated to the explanation.)

% CAT PageRank.m #! / usr / bin / octave ## PageRank.m - Computing PageRank (TM) Simple GNU Octave Script ## Sets Timer. Tic (); ## According to the definition of PageRank, the extended probability row column links from the file i to the file J is defined as m (i, j) m = [0, 1, 1/2, 0, 1/4) , 1/2, 0; 1/5, 0, 1/2, 1/3, 0, 0, 0; 1/5, 0, 0, 1/3, 1/4, 0, 0; 1/5 , 0, 0, 0, 0, 0, 1/3, 0, 1/2, 1; 0, 0, 0, 0, 1/4, 0, 0 1/5, 0, 0, 0, 0, 0, 0;] ## Computing the characteristic value of all M and the combination of the inherent vector column. [V, D] = EIG (M) ## Save the inherent vector to Eigenvector corresponding to the absolute value maximum value value. Eigenvector = V (:, Find (ABS (D)) == Max (ABS (DIAG (D))))) ## pageRank is the value obtained after the EigenVector is standardized on probability vector. PageRank = eigenvector. / NORM (Eigenvector, 1) ## output calculation time. ELAPSED_TIME = TOC ()

(2003/7/23: Fix the error of the above script.)

Mistake: Eigenvector = V (:, Find))))): Eigenvector = V (:, Find (ABS (D)) == Max (ABS (DIAG (D)) )))))))

Run this PageRank.m script with Octave After getting the following results in the standard output.

% Octave PageRank.m Gnu Octave, Version 2.0.16 (i586-redhat-linux-gnu). Copyright (C) 1996, 1997, 1998, 1999, 2000 John W. Eaton. this is Free Software with Absolutely No Warranty. for details, type `warranty '. M = 0.00000 1.00000 0.50000 0.00000 0.25000 0.50000 0.00000 0.20000 0.00000 0.50000 0.33333 0.00000 0.00000 0.000000.20000 0.00000 0.00000 0.33333 0.25000 0.00000 0.00000 0.20000 0.00000 0.00000 0.00000 0.25000 0.00000 0.00000 0.20000 0.00000 0.00000 0.33333 0.00000 0.50000 1.00000 0.00000 0.00000 0.00000 0.00000 0.25000 0.00000 0.00000 0.20000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 V = Columns 1 through 3: 0.69946 0.00000i 0.63140 0.00000i 0.63140 0.00000i 0.38286 0.00000i -0.28715 0.15402i -0.28715 - 0.15402i 0.32396 0.00000i -0.07422 - 0.10512 i -0.07422 0.10512i0.24297 0.00000i 0.00707 - 0.24933i 0.00707 0.24933i 0.41231 0.00000i -0.28417 0.44976i -0.28417 - 0.44976i 0.10308 0.00000i 0.22951 - 0.13211i 0.22951 0.13211i 0.13989 0.00000i - 0.2 2243 - 0.11722i -0.22243 0.11722i Columns 4 through 6: 0.56600 0.00000i 0.56600 0.00000i -0.32958 0.00000i 0.26420 - 0.05040i 0.26420 0.05040i 0.14584 0.00000i -0.10267 0.14787i -0.10267- 0.14787i 0.24608 0.00000i -0.11643 0.02319i -0.11643 - 0.02319i -0.24398 0.00000i -0.49468 - 0.14385i -0.49468 0.14385i 0.42562 0.00000i -0.14749 0.38066i -0.14749 - 0.38066i -0.64118 0.00000i 0.03106 - 0.35747 I 0.03106 0.35747i 0.39720 0.00000i Column 7: 0.00000 0.00000i -0.00000 0.00000i 0.00000 0.00000i -0.00000 0.00000i 0.81650

0.00000i-0.40825 0.00000i D = Columns 1 through 3: 1.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i -0.44433 0.23415i 0.00000 0.00000i0.00000 0.00000i 0.00000 0.00000i -0.44433 - 0.23415i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000 i Columns 4 through 6: 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.02731 0.31430i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.02731 - 0.31430i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i -0.16595 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i Column 7: 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i 0.00000 0.00000i -0 .00000 0.00000i EigenVector = 0.69946 0.382860.32396 0.24297 0.41231 0.10308 0.13989 PageRank = 0.303514 0.166134 0.1405750.105431 0.178914 0.044728 0.060703 elapsed_time = 0.063995Octave the output characteristic value is expressed as a diagonal line D is a diagonal component, each characteristic value The corresponding inherent vector is represented as a column vector of the raceful V. That is to say M * V = D * m is established. If the characteristic value here is 7, the characteristic value of the absolute value is λ = 1. Solid vector with it is solid vector:

Eigenvector = 0.69946 0.38286 0.32396 0.24297 0.41231 0.10308 0.13989

That is, the column V. Note that this probability vector (the n-element non-negative vector of the element is equal to 1) in the intrinsic vector (the n-element) of the element is not standardized, except that the vector is equal to 1. Expression is used to express, σpi 1, σ (pi) 2 = 1. Here, standardization of probability vectors PageRank = 0.303514 0.166134 0.140575 0.0431 0.178914 0.044728 0.060703

PageRank is ranking. Note that all added and 1. The calculation is only 0.064 seconds.

Ask of PageRank evaluation

Arrange the evaluation of PageRank (PageRank decimal point 3).

Name PageRank file ID issued a link ID Link ID 1 0.304 1 2, 3, 4, 5, 7 2, 3, 5, 6 2 0.179 5 1, 3, 4, 6 1, 4, 6, 7 3 0.166 2 1 1, 3, 4 4 0.141 3 1, 2 1, 4, 5 5 5 0.05 4 2, 3, 5 1, 5 6 0.061 7 5 1 7 0.045 6 1, 5 5

The first thing to pay attention is that the number of PageRank's rankings and reverse links is basically consistent. Regardless of the positive links, how many reverse links have fundamentally determine the size of PageRank. However, only these significant differences between the first and second bits (the same, the difference between the third and 4th, 6th and 7th bits). In short, the wonderful place is that PageRank is not just based on the number of reverse links.

Let us look at it in detail. ID = 1 file PageRank is 0.304, accounting for one-third of the whole, which has become 1st. In particular, it is necessary to get a considerable effect is that all PageRank (0.166) numbers are obtained from the ID = 2 pages of the third bit. ID = 2 The page has a reverse link from 3 places, and only one link for the ID = 1 page, so the link to the ID = 1 page) will get all the PageRank number. However, because the ID = 1 page is the positive link and the most reverse link, you can also understand it is the most popular page.

Conversely, the last ID = 6 page is only 15% weak evaluation of ID = 1, which can be understood because there is no link from PageRank's Id = 1 to make it greatly affected. In summary, even if there is the number of reverse links, the high and low of link source page evaluation also affects the high low of PageRank.

Refiguring the chart of link relationship between the page mutual (add PageRank)

Try to calculate the payering of PageRank. Because λ = 1, the calculation is simple, as long as the inflight from each page is simply added. Such as ID = 1 inflow is,

Infusion = (Rank) (ID = 3 emitted by ID = 2) (Id = 5 Rank) (Id = 6 Rank) = 0.166 0.141 / 2 0.179 / 4 0.045 / 2 = 0.30375

The payment of PageRank is in accordance with the error range. The same thing in other page IDs. The above PageRank pursuant is a revenue. PageRank emitted along the respective links is equal to this page of the original PageRank divided by the value of the link, and the PageRank balance of the respective pages. However, this is a wonderful balanced itself, and it is of course not surprising to understand the number of people who understand the number of lines. Because this is the "characteristic value and inherent vector", in short, the group selected is inherent vector. But even if this is the case, it is actually confirmed that it has been able to use PageRank's way to consider it well.

The above is the basic principle of PageRank. Google is doing a large-scale issue of this very characteristic value.

4. Problems when practical

The basic consideration of PageRank is not very difficult. The huge ingredients in the practical effect are not complex bizarre algorithms, but simple linear transformations, it is better to belong to a concise intuitive category. However, actually use the web hyperlink configuration to calculate the PageRank, it is not simply able to illustrate what to explain with your mouth. There are two main difficulties. First, the numerical model and the real world of pure hypotheses; Second, in the actual numerical calculation (expertise).

Preparation: The explanation of mathematics (main probability process)

There is a deep relationship between the Markov process in the laundering probability ranks and probability. This chapter first leaving the description of the PageRank itself, pre-explain several mathematics uses in the probability process. Because it is a quite difficult part, you can skip this if you can't understand it. (It may also be my explanation method.) At the same time, please note that it is almost no proven. For detailed explanation, please read the textbook.

From the state I of the chart S, the probability of replying to the status i will be replied to the probability of the status i after the limited time, that is, when the path of the position is returned to the original position in the direction of the (with direction) chart. I is "regression". The state that cannot be returned is called "non-return". From the state I, when the probability of the state J is reached by the limited number of times is not negative, we say that "from state I reaches state J is possible". When the reverse direction may be reached, we call "I and J may arrive." When any of the other states cannot be reached from the status i, it is called "absorption state".

From any vertex of the chart (Graph) determined by the adjacent row column A, the path to other vertex graphs is called "strong coupling" (also known as "decomposition") when arriving like arrows. Strengthening, equivalent to any state to any state can be reached with each other. There are a lot of 0 when there are many ingredients of adjacent rows A, and there is a problem. Note that if all the ingredients are Aij ≠ 0, they are strongly coupled. Because the corresponding Markov chain sample path represents any two points of the S, in the positive probability of positive probability.

We can divide all status in equivalents (or regression classes). Here, the regression class refers to the range enclosed in the link. The state that belongs to a equivalent class can be reached with each other. The possibility of entering other classes from a class is also present. However, it is obvious that it is impossible to reply to the original class in this case. Otherwise, these two classes are equivalent to equivalents. The following figure shows that when T is the equivalence class of non-retractability, when R is used as a represented equivalent, although there is a case where the Markov chain is neither comes from the return class, it is not from the non-regression class, but if it is from The first two words, no longer returned to the non-regression class.

Returning, non-return schematic (modified the Valley (1997) Figure 11)

In this equivalent relationship, only one regression class, the Markov chain is called "the simplest". In other words, all states can be reached when they arrive. The most backless time is strong.

There is no associated adjacent row (or the probability row) of each other, multiplied by the appropriate replacement row (switched rows and columns)

P = | P1 0 | | 0 P2 |

Such a relationship. This means that there is no direct link relationship between the regression classes P1 and P2.

Returns, non-regression-cated adjacent rows (or launched probability ranks), by multiplying proper replacement ranks,

P = | p1 0 | q p2 |

This relationship (q ≠ 0). At this time, P1 is a non-regression class, and P2 is a regression class.

The probability ranks are sometimes referred to as Markov ranks. The observations of the Test Route called the Markov process are Markov Chain. The Malchetcia chain will tend to a balanced state when it has been considerable. For any status i, if j is a non-regression state, pij (n) → 0. Conversely, when I is non-regression, J is a return, the probability of staying on the state i is 0. If i, J belongs to the same non-periodic regression class, PIJ (N) → PJ ≥ 0.

Theorem: If P is a limited Markov line, the reproduction of the characteristic value 1 of P is equal to the number of regression classes determined by P. (Proof too long, omitting).

The maximum strong joint component (the set of states corresponding to the corresponding state) is referred to as the ERGODIC portion (all over the part), and the strong joint component is referred to as a dissipated portion. Since the initial state probability x (0) begins, after x (n) = p (n) x (0) after time n, the state probability belongs to the dissipated portion is almost close to 0. Regarding the Ellgoth section, along with a class corresponding to each of the associations, like an independently simple Markov chain, where the state probability (ie, from the previous mean) value and initial status probability Non-related, in other words, it is similar to the inherent vector proportional to the most simple component of the corresponding P. The dispensing between the probability between classes is dependent on the probability of the initial state.

The unchangeable distribution of the discrete timing Markov chain belongs to the limit distribution, from that distribution, is no change in the distribution of changes over time. The probability distribution of the state is called a fixed distribution when the time change changes. PageRank uses the Markov process that PageRank is a fixed distribution that users access to each page at a certain period of time.

Imaginary model and different real world

So, let's take a look at the probability process (ie chart principle) and the actual web link constructed.

For the imaginary web pages just now, as long as the links are moving with each other, they must have a relationship between each other. That is, there is a row of the graph to be strong than returning and the simplest. Like many of the probability process of the probability process, many proves are proven to return the return and the simplest, and if it is the skilled, various nature become easy.

But the real web page is not a strong link. That is to say, the adjacency row is not the simplest. Specifically, if the link is moved, sometimes it will go to a web page that is not linked out. Usually, only the "return" feature of the web browser is used. If people just browse, everything is over, but the calculation of PageRank cannot be over. Because PageRank is not returned once it is introduced. PageRank said this page is for "Dangling Page". The same reason is that only outward links without reverse links are also present. But PageRank does not consider such a page, because PageRank that does not flow in PageRank, will be very strange from the symmetry.

At the same time, sometimes there are phenomena that links only in one set inside and not to the outside world. This is a problem that may occur when a non-periodic returning class multiple exists. (Please consider the case where the R and T will not be moved in one R that falls into the figure below. PageRank is called "Rank Sink". In the real page, no matter how to move forward, only the link is definitely unable to enter the page group, that is, these page groups are formed from most similar value classes (regression classes) that are not associated with each other. of.

In short, it is not the sure of the latter probability that the real web page consists is not the simplest. When not the most on the simplest, the maximum feature value (ie 1) is repeated and cannot avoid the problem of most of the existence of most of the vectors. In other words, PageRank is not determined in a sense. Here, in order to solve such problems, PageRank considers a "user although in many occasions, the links in the current page are moving forward, but often jump into the completely unrelated page", such a browsing model. Furthermore, "time" is fixed to 15%. The user moves forward in 85%, but in 15% case, it will suddenly jump into the unrelated page. (Note: The original technique of PageRank is 87% (= 1 / 1.15) and 13% (= 0.15 / 1.15).)

The following formula is represented by this formula.

M '= C * m (1-c) * [1 / n]

Among them, [1 / n] is all the n times square rows of 1 / N, C = 0.85 (= 1-0.15). M 'is of course equally the probability of the lapse. That is, according to the deformation of PAGERANK, the characteristic value problem in which the ranked M 'is originally determined into the privacy vector characteristic value problem of the row column M'. M is fixed without a memory source (I.I.), M 'is called "mixed information source", which is a typical example of fixed but non-Ellgoth information sources.

If you look from a mathematical perspective, another statement of "The Sector of the Non-Sleeter Transfer Row" is the transformation operation of "the chart that is not strongly coupled into strong joint". The so-called all elements consider the migration probability of 0.15, means that the original non-simplest launch probability row column (of course, the situation is also presented in cases). In response to the original transformation probability, such a transformation operation can be defined in a sense, it is to ensure that the repetition of the maximum characteristic value can be 1. If such a transformation operation is considered, since the number of regression classes of the probability row becomes 1, it is also the most simplified, according to the previous theorem, the excellent vector (ie, PageRank) is defined in a sense.

Problem points on numeric values ​​(1)

Here, as long as it is probably understood that the concept of PageRank is, it is not necessary to have a deep fade in the problem of numerical calculation techniques (in fact, the author is unclear even if there is confidence). However, because characteristic value analysis and connecting an equation analysis are used in one of a variety of statistical calculation techniques, so we simply touch some analytical methods.

The problem in the main memory field is one of the problems in numerical calculation.

Assume that n is the ORDER of 104. Typically, the internal row and vector of the numerical calculation program is used in double precision record, the memory field of N times square row A is SIZEOF (double) * n * n = 8 * 104 * 104 = 800MB. The 800MB main memory field is not that often there will often be, although this is not that impossible number. However, if N is turned into 105 or 106, it becomes 80GB, 8TB. This kind of thing is not used to even say that the hard disk is also very difficult. Google has been able to know that this rule is not applicable to the process of handling more than 1 billion pages (2001).

However, a is just a sparse ranking. Because even some of the page is desperately linked, the page that is linked to the entire web is not very rare, even if it is also extremely rare. On average, each page has 10-20 or so links (according to the statistics of the IBM Almaden Research 'Graph Structure In THE Web', the average is around 16.1). Therefore, we can compress A with an appropriate compression method. Even if the average link number is 10, the final memory field is only 80MB, from the scale, it can be stored in a reasonable number. The hull ranks have been fully studied (the solution to the limited elements method, etc.) can be learned in the professional book of the appropriate numerical calculation. Although this is said, because it is quite difficult to solve or need a very complex technique. But what you want to point out is that if you can solve it, the collapsed high speed calculation (perhaps) will become possible. Because how to arrange and accommodate non-zero factors, computing performance and parallel performance will be greater.

Questions in the numerical calculation point (2)

The other is a convergence problem.

Fixed equation

Xi = σAijxi

It is the end of the N-element, which is generally not an analytical, so it can only solve its value. In the example, in order to seek characteristic and inherent vectors, the Octave's EIG () function is used, but this is not applicable when the problem is small. Speaking, do not need to calculate all of the characteristic value / inherent vectors.

Ask the maximum characteristic value and a numerical calculation method belonging to its inherent vector (excellent vector), the "power multiplication" (also called repetition method) is generally used. This means that the initial vector X0 is taken, when X (N 1) = a y (n) (where Y (N) = X (N) / c (n)) is N → ∞, X-directed The solid vector convergence of the maximum characteristic value C converges the calculation method of utilizing the line algebraic properties (proves to refer to the textbook of the line algebra). Power Multiplication (Recurrent Method) The approximate method of repeatedly calculated the approximate method of successive repeatedly, can improve the problem of solutions. Its advantage is that because as long as the row and vectors are repeated, the multiplication of the number of ranks and vectors is repeated, as long as the program can be simply solved by the program, it is also possible to perform large-scale analysis of the direct method cannot be resolved by the direct method. This is the starting point of many practical algorithms.

Here, please note that the maximum value of the absolute value of the probability row from the line algebra is 1. If this is used, it will make the calculation of the repetition PageRank easier. That is, because the maximum characteristic value is aware, it is more simple to be more simpler than the vector X satisfying the AX = X, it is made. Although this is a very small place but it is very important. First, the division calculation of the cost can be removed (Y (n) = x (n) / c (n)) is not completed. If it is repeated, it cannot be highly accurate, and if it is wrong to accelerate the method, the calculated is not the maximum characteristic value but the second larger value value and the inherent vector belonging to it (although this is very Less, but it is not necessarily a value that is fundamentally wrong). But if you know the maximum feature value, you can check it. In the first papers of PageRank, they did not seem to notice this, but in the second paper of Haveliwala added amendment is added.

Repeated number of times depends on the accuracy of the required requirements. In other words, the higher the accuracy you want to require, the more times it is repeated. However, the convergence ratio of the power multiplication (repetition method) has a strong dependency between the spectral segment characteristics of the coefficient row (the absolute value distribution of the characteristic value). Specifically, the maximum characteristic value of the absolute value is represented by λ1, and the second bit is represented by λ2, and the superior rate (convergence probability of dominance) is D = λ1 / λ2, and it is possible to know that the closing 1 is closer. slow. During the case, D is of course very close to 1. This is because the maximum characteristic value of the absolute value is 1, while the absolute values ​​of all other N-1 characteristic values ​​are less than 1. However, the N-1 characteristic value is very crowded, so there is little difference between λ1 and λ2. Therefore, in general, convergence will slow down. The so-called convergence is slow, tailoring, is that no matter how much time has been completed. In this regard, in order to make the appropriate acceleration method that converges accelerates, it is necessary to understand the numerical calculation techniques, so if an expert is not a numerical calculation, it is difficult to introduce.

5. Actual installation experiment on Namazu

In order to make it easier to speculate on the above description, PageRank is not a non-world WEB page without using considerations, even if a personal utilization method can be implemented. In order to achieve "Personalized PageRank", the actual installation experiment is performed for the full-text search system Namazu for small and medium-sized websites operating on various UNIX and Windows. (About NAMAZU can refer to Japanese full-text search engine software list.)

Since the experiment can simply control the amount of memory, the maximum feature value is considered, the idea of ​​HAVE LIWALA (1999) is basically considered. But there is a little different from DANGLING PAGES. The calculation kernel of the inherent vector uses a numerical calculation script GNU Octave. So the basic code to write itself to solve it. In addition, from the index written by MKNMZ, PageRank cannot be calculated directly, and it is necessary to prepare an index of the neighbor relationship (adjacency list). This is also possible to be encapsulated in the main part of the indexer.

The actual calculation time (unit: second) is shown below. The configuration of the running machine is Pentiumii 400MHz X 2, memory 512MB, Kondara Mnu / Linux 1.2 (kernel-2.2.17-15ksmp), Octave-2.0.16 (general state distribution). The convergence accuracy (L1 specification of the remaining difference vector) took the 1.0e-10, perhaps some excessive accuracy.

Number of documents N MKNMZ time preparation time PageRank calculation time ======================================== ==================== 128 58 2 62, 301 1, 575 46 21449, 604 15, 975 478 5, 872

Because there is no test of some huge web page groups, the experiment only stays on a small scale. Although there is this difficult point, it can calculate the tendency of PageRank in a short period of time, which can basically understand the time spent with the index.

Because Namazu itself has a lot of problems, it does not have a big luxury, but at least 105 degrees (as 106) Web page groups are used to experiment. From the trend, it is expected that the computation time of n = 106 will be divergent, so when n = 106, if it is possible to discuss how the MKNMZ time is changed to Comparable, it is very practical for Personalized PageRank. . As a reference, according to Page et al. (1998), Google's actual PageRANK calculation time for 75 million URL is about 5 hours. (Unknown in February 2001). From this perspective, it is necessary to study more efficient accelerated methods. Calculate the actual operation of the most memory is also around 10 MB. If it is the "Warfare" such as Haveliwala (1999), the maximum memory usage of about only o (3N 2) is done, but N is 104-5 degree and the memory usage is also invented. If other things can only be emphasized, so the code is written in O (5n α) (α) (α is a non-zero component number, typical is 5-20n or so) degree. When the additional N is about 103, it can be confirmed that the non-compressed loose ranks are used to calculate the power multiplication on memory, which is very advantageous from the speed surface. The measured time speed is about 6-7 times of the above numbers. But unfortunately, this method is from memory limitations, and only 2-3 k-free uses as much as possible.

This time we used Octave distribution of "Tsurushi", but, as you know, if you tune the Octave, it will dramatically improve the speed. The combination of Octave-2.1.x and Atlas sometimes increases the calculation speed of the large-scale row multiplication by more than 10 times depending on the situation.

Detailed results of the experiment Refer to the documentation in prNMZ-1.0.Tar.gz.

Basic nature of Personalized PageRank

People often use tools such as MHONARC, LATEX2HTML or PowerPoint to turn documents into HTML, and for such artificial HTML links PageRank, most of the page scores are almost the same (~ 1 / n). If the adjacency row is considered, most of the components are 1, or all of the opposite angle components are all 1. Since such an inherent vector of such a probability row is (1, 1, ..., 1).

Or when SiteMap.html is like a tree, the score will focus in SiteMap.html. Even if you occupy the 90% of the whole, it is not a novel.

From now on, in order to calculate meaningful PageRank, it is necessary to exclude the mechanical generation link relationship. If you see a link relationship as a recommended relationship, it is easier to agree.

6. Explain for PageRank's individual

(Reader) There should be no room to suspect that PageRank is used to use a hyperlink to determine the order effective technique.

However, after reading these papers, the author also considered many questions. Here, there are several personal opinions on PageRank. Although it is an insight, it is a methodology, and there may be many mistakes.

What is the reason for Dangling Page?

Just because when considering a certain variant probability, is it possible to become the least sequential? Still sometimes you have seen? I don't quite understand it.

Improve the possibility of the probability row, in order to ensure the single meaning of PageRank (one), as long as it guarantees the brigadus probability ranking is the simplest (with the chart is strong), there is no need to all the elements Aij is non-zero Elements. In fact, after browsing the Toyota car website on the Web, it will be jumped to the porn website, then continue to jump to the weird people who browse to the White House. (Note Here is a continuous form that changes over time). Therefore, from the practical sense, the extent to which the use of how much is used, should be left to the algorithm. Considering how the "stay probability" will be based on PageRank's consideration, you must move to other pages after a certain time, or suddenly weird, distorted jump to other pages. But if the real web browsing model, you should also consider a certain stay probability. Specifically, it is too small to take only (1-c) / n in the diagonal component of the probability row. What happens in the case of all the probability of all changes? Because for boring pages (viewers) must try to go to another page, it will stay longer for important pages.

If you think about probability theory, you will consider many other problems.

Even if we will achieve the degree of sex, we will try to further consider this idea. In the probability theory, there is a probability of a destruction probability or a fixed probability. In the same way, the introduction of this consideration will be more expected, so it is considered by everyone. Everyone knows the approach to the branching process in the Markov chain. This is a model for considering genetic genetic mutations, that is, a model of the possibility of changing the elimination after a certain amount of time. Many people think this consideration may be adopted. So what happens to the probability (taboo probability) with restrictions? That is, when the introduction of the introduction passes from state I moves from state I, the probability of the state K is not passed. If you consider the nature of web browsing, isn't it possible to be assumed?

Can you consider as a non-Markov process (or multiple Markov procedure of M times)?

The so-called Markov process is not related to the past experience, only the probability process of the future probability rule only from the present state. The Markov process is only in the process of 1 step. This process does not have any elements that have been derived from the past. PageRank is the result of the simultaneous Markov process that is fixed in a state of time. However, human rational actions must be expressed in the process of non-Markov. The complex process is always implicated in some forms and past. Therefore, it is not only to analyze which page connection from which a page is connected, but to analyze how the path is connected. Such an analysis will make it possible to be a more useful sorting system. It is also interesting to introduce non-Marcov processes in the range of calculation of the quantity explosion.

In considering and seeing a lot of things, there is something less difficult as actual installations, and because it is just the right to talk, I don't know how to install it, no matter what, quantitatively evaluate its effect is extremely difficult. Is it really something that can't be realized?

How much is PageRank's technology?

Even if it is just PageRank technology with high evaluation, it is only used as a basic idea that the depleted numerical analysis is used. However, things I have explained here if I have a matter of course from a professional researcher. Just overcome the scale, you can build a professional research area. It can also be considered that there is not so deep in the professional field. In fact, I am doing things, I am not intended to "if it is extremely small problem, even if it is a textbook's technique, it can also get the results of the calculation."

Although this is the case, in order to touch the surface of the summary, it is not understanding that "there is nothing, it is the extent of such a simple technique". It emphasizes this here: this shallow view is fundamentally completely wrong. Of course, PageRank skills are very nice places that "the page connected from many high-quality pages is or high-quality page", if you understand, you will feel that it is a simple idea. It is further, the real place is that it is not only to think of an idea, but the probability distribution of the idea to the fixed state change, in order to actually install the experiment, and prove it in reality The field can also work well. It is really worthy of praise at all these stages.

Indeed, not only has a new and smart idea, plus the technological combination of textbooks, and it is also possible to make a search engine that can be able to make a google match (or over). It can also be said that Google is also doing this. However, people actually completed are rare. There is a difference between the "definitely capable of completing" in the imaginary model and the actual operation. On practical problems, handling large-scale row of row, and is also quite difficult by general techniques, and it is necessary to highly expertise. It should be remembered that there will be a gap between things that can be understood between things and realizations that can be understood in the mind and implementation. Do not be considered too much.

7. References

The following is listed below to associate papers other than the basic papers described in "Foreword". (The translator has dropped a lot of useless connections)

S. Brin, L. Page, 'The anatomy of a large-scale hypertextual web search engine', http://www-db.stanford.edu/~ backrub/google.html Mountain famous early, near the vine show, "explanation: Search Engine Google (Search), Information Process 42 Volume 8 (August 2001), PP.775-780 (PDF) Jianchang Ji, "Road Sign: WWW Search Engine Establishment Method" (Summary), Information Processing 41 Volume 11 (November 2000), PP.1280-1283 Ji Tian Ji, "Search Engine Search Results", Bit 2000 (Vol.32), PP.8-14 US Clever Project, " Smart use of super link "(Summary), Nikkei Science, September 1999, PP.28-35 Dell Zhang, Yisheng Dong, 'An Efficient Algorithm to Rank Web Resources', http://www9.org/w9cdrom/ 251 / 251.html Jon M. Kleinberg, 'Authoritative sources in a hyperlinked environment', Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. http://www.cs.cornell.edu/home/kleinber/auth .ps ibm almaden research center, 'Clever Searching', http://www.almaden.ibm.com/cs/k53/clever.html

The reference books associated with mathematics are listed below.

S. Carlin, Sato, Sato, Sato translated by the body, "Probability Process Lecture" (Mathematical Analysis and Peripheral 3) Year, Industrial Book Isozheng, "The Lines and Application of Economics, Engineering", 1987, Jii Guoda Bookstore, ISBN4-314-00477-0 Lvatokinson, Pj Harry, JD Hudson, Shengu Birth, Dano believes in, Zoudui Feng, Bei Rong supple translated, "numerical calculation and its application - Fortran77-", 1993, Science, ISBN4-7819-0690-7 Palace Zezheng Qing, "Probability and Probability Process (Modern Mathematics Research Group 17), 1993, modern scientific society, ISBN4-7649-1034-9 Ichezi, "Line Algebra II" (Rock Lecture Applied Mathematics 11), 1994, Rock Bookstore, ISBN4-00 -010521-3 Han Taizhen, Xiaolin Xinwu, "Information and Symbolic Digital" (Rock Lecture Application Mathematics 13) Applied Mathematics and CG - (Information & Computing = 86), 1995, Science, ISBN4-7819-0763-6 Changguichuan, Changguchukawa ), In 1996, ISBN4-254-11401-X Tale really, "Test each and probability 2" (Rock lecture modern mathematics foundation 10), 1997, Rock Bookstore, ISBN4-00-010640-6 Fujiino, "The foundation of numerical calculations - as a center with numerical solution" (the foundation of new information engineering 9), 1998, Science, ISBN4-7819-0861-6 with online news report on Google (Japanese) News) has been separated to another page (Googlenews.html). (2003/5/20)

Others, specifically lists a few pages that are considered associated.

Interview with Google's Sergey Brin (Translation) Business Model and Search Technology Trend - Take Google as an Example - (JCOT Report) Smart Separately | The 21st Century Search Engine (InternetWatch) "Map" The research results are announced. 10% Not Link Site Research Results "Search Engine Retrieving Some" Retrieval Results of the Search Engine inequality - Stop the Age, you are beautiful - (Yomoyomo clan ) Google Weblog (The Cluetrain Weblog) Google's PageRank: Calculator (Web Workshop)

Thanks for reprint! All other personal sites and BBS have introduced this article.

How to improve the list of websites in Google (2003/1/6 report) in the zdnet china. Can't read ... :-) ZDNet China how to evaluate the popularity of a website (2002/8/5 report) or Chinese. Don't understand ... :-) 中 村 正 三 "Bravo! Linux" Linux Japan May 16th, 2001 INTERNETWATCH Summary INTERNETWATCH Summary News 2001 Number of Google World > Japanese> Computer> Internet> WWW> Home Search Directory LYCOS / Computer, Internet / Internet / Site Search, Link Set / Search Engine / Robot Search / Google Directory Yahoo! JAPAN Business Economics> Enterprise> Internet Services> Inter-Enterprise Transactions ( Btob)> Retrieve, Navigation> Google Directory 8. Appendix: "Guguru? / Goguru?"

English (American English) is impossible to read Google as "GoGuru". As the Noodle pronunciation or marked with no one pull surface is "Nodoru", if you want to use a slice of pseudonym, you should write "グ グ グ グ."

However, there is an English word with oo this spelling.

,, ..., ...

These are simple general English words, but no matter which one is "u:" this pronunciation. At least, it is like this for many typical Japanese people. English (American English), OO spelling basically reads "U". Of course, GOO is read into "Gu:". Guangdong, the son is not also said in the TV advertisement of the Chinese ancient car information magazine. "If you want to say the car, gu-"? In addition, spelling using swimming glasses is Goggle during swimming.

Of course, if Google is not English (American English), it is another matter. However, the origin of the Google name is from the English word "Googol" that represents 10, maybe it is more suitable for English pronunciation (Google). Needless to say, Googol's pronunciation is also "Guguru".

In addition, one of the entrepreneurs is Sergey Brin. From his name, he will understand that he is Russia, and it is possible to have his English pronunciation with his own dialect. If you are pulling there, it is already awkward. Moreover, I don't know how to pronounce Google's local accents in Russia. If you know someone, please tell me.

Supplement (2001/4): A reader who sent "GOGURU or GuGuru?" To the Google's support center, and forwarded this email passion. The other party said that although Google's own pronunciation is "gunguru", however, you will never mind with your own favorite call.

Date: WED, 31 Jan 200116: 12: 01-0800From: "Googletech" Subject: Re: {Google # 034-917} PRONUNCITIONTO: Transfer Mail (THANKS)! WE GO BY: "GU Gul "But you are welcome! Regards, The Google Team

Supplement 2 (2001/10/29): Please see how Google's page "Google" pronounces.

Hajime Baba / Racecourse Copyright (C) 2001-2003 Hajime Baba. All Rights Reserved. $ ID: PageRank.html, V 1.113 2003/07/23 00:38: 48 Baba Exp $ Translation: KRENY / Yuan Huanglin Created: 2003/12 Last updated: October 28, 2004 3:53

Back to home page

转载请注明原文地址:https://www.9cbs.com/read-32876.html

New Post(0)