Copyright Notice: You can reprint anything, please be sure to indicate the original source and author information and this statement by hyperlink. Http://linux.dalouis.com/pagerant_cn.htm
1 Introduction
Recently, the search engine google (http://www.google.com/) is very eye-catching. Google is a retrieval service based on the Search Engine developed by the Search BRIN (February 2001) based on the CEO-based Larry Page (February 2001). Google starts service from September 1998, but Netscape Communications starts with its cooperation in Google's test phase. US Yahoo! company also will default search engines from June 2000 (US Yahoo! can not retrieve as a added search engine) The Inktomi, which is originally collabted into Google. The Japanese version Google officially debuted in September 2000 and is now adopted by Biglobe (NEC). (Note: In April 2001 Yahoo! japan and @ nifty, Sony in July, January 2002 Excite also successively established collaborative relationship with Google).
The advantage of Google's evaluation is not only to remove useless (advertisement) slogans constitute a single page of the function, but a single Cache system, dynamically made a summary information, a dispersion system set by high-speed retrieval (thousands of LINUX cluster ), Etc., the biggest advantage is the correctness of its search results. A technique that automatically determines the importance of the web page "PageRank is (web page level)" is a technology designed for this. The purpose of this article is to explain the summary and principles of the PageRank system in terms of easy-to-understand language as much as possible.
The following is a basic article of PageRank.
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, 'The PageRank Citation Ranking: Bringing Order to the Web', 1998, http://www-db.stanford.edu/~Backrub/pageRankSub.psub.ps
In order to calculate PageRank more efficiently, the following is a paper after improvement.
Taher H. Haveliwala, 'Effectient Computation of PageRank', Stanford Technical Report, 1999, http://dbpubs.stanford.edu:8090/pub/1999-31
In addition, the following is a PageRank's demonstration information.
Larry Page, 'PageRank: Bringing Order to the Web', http://hci.stanford.edu/~page/papers/pageRank/ (invalidated)
Next, the two articles (additional information) will be basically described. First, use a simple example to explain the concept of PageRank, and then attribute to the sequencing system using hyperlink relationships to solve the problem of the characteristic value of large-scale loose matrices. Then we will contact some questions and corresponding methods that occur when applying the basic model in the real world. Next, in order to explore whether it can be used as "personal PageRank", the installation experiment of the free full-text retrieval system Namazu is performed and the results are elaborated. Finally, I published my personal opinion on PageRank.
In addition, in order to understand the following description, mathematical knowledge (especially linear algebra) is required (especially linear algebra). However, in order to make the liberal arts can also read the problem as much as possible to explain the problem as much as possible. At the same time, in order to join the author's personal opinion, there is no more algorithms and numbers like the original text, and there are many not strict and correct places. , In advance, in advance. For details, please refer to the original text.
PageRank (TM) is a registered registered trademark of Google, USA. 2. Basic concept of PageRank
PageRank is based on the return relationship between "web pages, must still be a high-quality web page" from many high-quality web links, to determine the importance of all web pages.
In the following lengthy descriptions, many parts of a large number of professional terms will cause understanding. Although this chapter is ready to focus on qualitative and simple explanation, even if so, when you don't understand, you can understand how you can understand "web pages from many high-quality web links, must be a high quality page" Thinking methods are also very valuable. This is the most important thinking method because of all the points.
From Google's own introduction "Google's popular secrets (http://www.google.co.jp/intl/ja/why_use.html)" is the same below.
About PageRank PageRank, effectively utilizes the characteristics of the huge link constructs owned by the Web. The link from the page A-oriented web B is constructed as a support for page A. The Google determines the importance of the page based on this voter. However, Google does not only look at the number of votes (ie the number of links), and the page of the voting is also analyzed. The value of the ticket voted by the "Important" high, because this voting page will be understood as "important items". According to such an analysis, it is given a high evaluation of the important page that will be given higher Page Rank, which will increase in the retrieval result. PageRank is a comprehensive indicator of the importance of webpage in Google, and will not be affected by various retrieval (engines). It is better to say that PageRank is based on the analysis of the analysis of "link constructs" with complex algorithms, thereby the characteristics of each page itself. Of course, there is no meaning of the page with high importance if there is no meaning of the search words. To this end, Google uses a refined text matching technique that enables it to retrieve important and correct pages.
Through the following graphs, we have to see the algorithms just explained. The specific algorithm is, divides the PageRank of a page to the forward link existing in this page, and the resulting value is added to the PageRank of the page pointed to by the page pointed outward, that is, the PageRank of the page is obtained.
Let us look at it in detail. Improve PageRank's key points, there are three.
Reverse link number (simple sense indicator in a simple sense) reverse link from the highly recommended page (with a popular indicator) reverse link source page (selected probability indicator)
First of all, the most basic is that many page links will increase the recommendation. That is to say, "Many page links), the popular page, must be high quality pages." So the number of reverse links as a popularity of the popularity is a natural idea. This is because "link" is a recommended behavior that is seen as "can look at this page / page." However, it is worthy of pride that PageRank's thinking method does not stay in this place.
That is, it is not only through the number of reverse links, but also the reverse link of the recommended higher page with a higher evaluation. At the same time, the link from the total number of pages from the total link is given higher evaluation, and the link from the number of pages from the total link is given lower evaluation. In other words, "Collecting a number of recommended pages recommended by the page, must also be the same as the same page", "Compared with the linked link by the chaos, it is definitely a high quality link. These two judgments are carried out. On the one hand, a regular link from others High level web page will be clarified, on the other hand, links from the bookmarks that have completely do not have access to a bookmark page will be "almost no value (although it is not linked) It is better to be sighful.
Therefore, if you are linked from the very high site similar to Yahoo!, only this page will rise at once; vice versa, no matter how much reverse link number, if all are all from those do not make much meaningless If the page link came over, PageRank will not rise easily. Not only Yahoo !, in a certain area can be called a reverse link from the authoritative (or fixed) page. It is very beneficial. However, it is just a link to make some of my own companions, such as the fact that "simple internal care" is difficult to see what value. That is, it is truly valuable to judge (your webpage) from the viewpoint of watching all the web pages worldwide. These indicators are combined, and eventually form a search structure that will be relatively retrofitted by the higher evaluation of the page.
The past practices simply use the number of reverse links to evaluate the importance of the page, but the advantage of the way PageRank is unacceptable to the mechanical link. That is, in order to improve the reverse link of the high quality page to improve the PageRank. For example, if you delegate Yahoo! to log in to your own website, you will have a sudden rise in PageRank. But for this purpose, you must work to make (web pages) enrichment. In this way, it is made substantially not to improve the proximity of PageRank (or the back door). Not only limited to PageRank (Clever and Hits, etc.), in the sorting system with link constructs, previously pure SPAM methods will not be common. This is the biggest advantage, and the biggest reason for Google is convenient for use. (Although it is the biggest reason, it is not the only reason.)
Note here, PageRank itself is quantified by Google, and the expression of the user retrieved content is completely independent. Just like the back is about to be explained, the search statement does not present in PageRank's own calculation formula. No matter how much search statement, PageRank is also a certain amount of score inherent.
The qualitative statement of PageRank is roughly such a few. However, in order to actually calculate the order, the comparison level requires a more quantitative discussion. The following chapter will make a detailed description.
3. How to get PageRank
What we are interested is that when there is a cross-link configuration, it is quantified which page is "important". In other words, this is the process of strictly calculating this indicator of "which page should be read from which page should be started". Even if you don't read the small page, there is no way.
Then, generally, in order to make the hyperlink configuration like the web can be reflected in the order of the order, there is a need to establish a digital model of the hyperlink configuration on the computer. How to model how to depends on the policy of the installer, so that if the chart theory is applied to observe the hyperlink structure, it will eventually return to the line algebra. This is also the same for PageRank.
Calculation method
As the most basic consideration, it is to express link relationships in the form of row array. When you link from the page i to another page J, the component is defined as 1, and it is defined as 0. That is, the ingredient Aij of the row array A can be used,
Aij = 1 IF (from page i to page J "with" link)
0 IF (from page i to page J "No" link)
To represent. If the number of files is represented by n, this row line becomes N × N 's square. This is equivalent to the "adjacency ranks" in the chart theory. That is, the link between the web can be seen as an adjacency of the chart S. All in all, as long as the link is established, there should be an adjacency.
(* Note) The graphic composed of the line connected by the point and point is referred to as "Graph". These points are called "vertex" or "node"; these lines are called "Edge" or "Arc (ARC)". The chart is divided into two categories, "edge" no direction chart is called "undirected graph", "edge" with direction charts are called "Directed Graph". The road to the chart is imagined into one-way traffic. The chart can be represented by a variety of methods, but generally used in the data structure "Adjacency Matrix" and "Adjacency List". It should be noted that if it is a non-map chart, the adjacent row column A is a symmetrical row, and if there is a chart, a will become asymmetric rows. The following is an adjacency row of the online manual (128 pages) of the Apache represented by a bitmap. When the black dot is arranged laterally, it means that this page has a lot of forward links (ie, a link to which to export); in turn, when the black store is arranged in longitudinally, it means that this page has a lot of reverse links.
Example of adjacent rows (using Apache online manual)
The ranks of PageRank is backward (rows and columns to each other), in order to turn the sum of the column vectors into 1 (full rate), divide each column vector (non-zero) Number of elements). The ranks such as the "Promotion Probability Rows" containing N probability variables, each row vector represents the probability between the state. The reasons for inversion is that PageRank is not paying attention to "How many places" "is" "is". "
The calculation of PageRank is to seek an intrinsic vector (preferably a vector) of the maximum characteristic value of this probability row.
This is because when the linear transformation line T → ∞ is getting up, we can simply describe it fundamentally based on the "absolute value maximum value" and "inherent vector belong to it" depending on the transform line. In other words, the probability process represented by the promotion probability row is a process of multiplying the multiplication of this row, and the probability of the front state can be calculated.
Furthermore, although it sounds difficult, the value of the characteristic value and the inherent vector is a basic mathematical means capable of strictly analyzing. We are able to freely assign a value to the initial value of the vector, but the resulting vector will be concentrated in a combination of some specific values. We refer to the combination of stable values as inherent vectors, refer to the characteristic scalar (scalar) in the intrinsic vector, and the problem of the calculation method is generally referred to as the decomposition characteristic value, referring to the problem of the problem value of the characteristic value is called the characteristic value. problem.
(* Note) The number of squares that satisfy AX = = λx is called A, which is the intrinsic vector belonging to λ. If you can't adapt to the concept of the ranks, you can also consider the N × N binary arrangement. At the same time, the vector can be considered as a common (one dollar) arrangement of the length N.
Simple example
Let us use a simple example to try to calculate PageRank by times. First consider 7 HTML files that are like the link relationships shown below. Also, the link relationship between these HTML files is only closed in this 1-7 file. That is, there is no other link in addition to these documents. Also note that all pages have a forward and reverse link (ie there is no end point), which is also an important assumption that will be proposed later, which is not discussed here.
Transfer map showing mutual link relationship between page
First, the adjacent list of this transmissive graph chart is represented as an arrangement, and there is the following form. That is, the ID of the link target is listed in accordance with each link source ID.
Link Source I D Link Target ID
1 2, 3, 4, 5, 7
twenty one
3 1, 2
4 2, 3, 5
5 1, 3, 4, 6
6 1, 5
7 5
The adjacent row column A of the link relationship represented in this abutment list is the following 7 × 7 square rows. A Bitmap Matrix is only features (Bitmap Matrix). The horizontal view line represents the file ID from the file I forward link. A = [
0, 1, 1, 1, 1, 0, 1;
1, 0, 0, 0, 0, 0, 0;
1, 1, 0, 0, 0, 0, 0;
0, 1, 1, 0, 1, 0, 0;
1, 0, 1, 1, 0, 1, 0;
1, 0, 0, 0, 1, 0, 0;
0, 0, 0, 0, 1, 0, 0;
]
The PageRANK-style brushed probability row M is obtained by dividing the A inverted each numerical value after the respective non-zero elements. That is, the following 7 × 7 square rows. Landscape View 第 第 非 零 要 表示 表示 表示 表示 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 文件 指.... Note that the value of each column is added to 1 (probably).