A book for guiding search engine theory
Introduction, intended to write the skeleton of this book, as for the flesh and blood, and add it. The basic one here is mathematical information, basically my work in these years.
Therefore, it is basically theoretical knowledge, and of course there will be some practical examples. If you ask "How to improve the rankings of the website?", "How to improve the number of times you searched?", Sorry, these questions are not my answer Inside, I have to write here about the theory of search, has been used by search engines and is useless, and have been disclosed or undisclosed.
Chapter 1 Digital Information Overview
Tell the history, characteristics of digital information. . . (slightly)
Chapter II Correlation
The correlation of information can be employed when there is no good way to calculate its correlation, the information space difference method can be employed:
For the sake of simplicity, we have set a-element result, and the search b got B_N results, combined with A B get AB_N results, then the correlation of A and B can be as defined:
Correlation = (AB_N) / (A_N B_N - AB_N) Chapter 3: Expression of Information
This chapter tells two questions: the angle and information of information
1] Accounting angle
Theta (i_a, i_b) = SQRT (Arccos (Relation (i_a, i_b)))
Information is a vector in the above expression, the angle between the information is characterized by the point between information. The results of the point by the point are the relationship between information (see the correlation of the information in the previous chapter), whereby the inclination between the information should be between 0 degrees to 90 degrees:
0 degrees, indicate information parallel, or by parallel information, indicating that the information is fully related. 90 degrees, indicating information orthogonal, orthogonal information, and there is no correlation between information.
This calculates the angle between UNIX and Linux: 73 degrees.
2] Expression of information:
The concept of information loss: For any information, the length of the information loss can be obtained, M_A = || i_a ||, then unit information loss is: i_a = i_a / m_a = i_a / || i_a ||
The appropriate selection of information is lost, so that the unit information is selected, then any information vector can be obtained by a combination of unit information.
We first assume a set of information loss to establish the following:
I_1, i_2, i_3, ... i_n. The entire information space has n-dimensional, and constructed by the information loss (i_1, .., i_n), then any information disablement in this information space can be written as the following format:
A = a_1 * i_1 a_2 * i_2 .. i_n * a_n
Where A_J = A takes I_J, J from 1 to N. Chapter 4: The shape of the information
This chapter mainly introduces the shape of information space and information
Keywords: Information Space Information Space, Information Shape Information Shape
1] What is information space?
The information space is a multi-dimensional vector space that is composed of a group of information base vectors that will completely contain information that needs to be expressed.
As the concept elaborated in the previous chapter, any information vector can be obtained by a combination of information loss, that is, the information has a linear expression in the information space.
2] What is the shape of information?
The Shape of Information is an overall summary of an information in the information space. The shape of the information is related to the content expressed by the information. I will classify the shape of the information here: Taking information A as an example of A_1 * i_1 a_2 * i_2 .. i_n * a_n where A_J = a i_J, J from 1 to n. (1) Line-type information This information indicates that this information is basically projected in an information loss, and the table is parallel to the same information of the entire information. A = i * || a ||
(2) Flat type information This type of information can be expressed by two primary and loss, which is flat-shaped: a = || A || (SIN (Theta) * i cos (Theta) * J)
(3) Cone information
This type of information projected in 3 or more of the above basement failure space
(4) Spherical information This information is projected on all of the basement and the substantially evenly distributed.
The information in reality is a combination of more than 4 types. It is really uncommon for information on the above (1, 2, 4) classes, and information belonging is relatively common. Chapter 5: Classification of Information
The information is classified according to the feature, usually there are two classification methods: 1] The information classification is classified according to the definition of artificial division, such as education, entertainment, business, and more. 2] The automatic class is in accordance with the information of the information in the space, and the information is checked for information.
Classification method; example:
The method used is more commonly used is "Sliding-Window" approach, that is, use a size suitable window to move in the ball surface of the information space, when this Window contains a large number of information, the number of information is used, as explanation is here. An information cluster, or called Info-Jet.
This WINDOW must be a maximum of a maximum, scanned to the center of the information cluster, and the extension can get the entire Cluster. Chapter 6: Cluster of information
Testing automatic clustering, finding Sliding Window can be simplified, implementing a degenerate algorithm, is a breakthrough, so overrinding this chapter.
LL 2004/11/22
With a good classification mechanism, you can get effective clusters for information. The clustering takes an automatic method substantially substantially three: 1] A method of performing a single fuzzy pattern recognition after NNET training on a large number of samples. The expertise of the method is to quickly and accurate automatic clustering, and the disadvantage is that a large number of samples are required for pre-training. Prevent excessive training and how to handle errors, etc. It is very important. Similar mechanisms also have KNN, SVM and Bayesian statistics, and refer to meticulous introduction.
2] Ping shift algorithm, or also called convolution (autocorrelation) algorithm. Corr = intergal (f (x) * f (x-t) DT)
Clusty's automatic cluster uses a panning algorithm. The translation algorithm is characterized by rapid calculation and easy to use. The disadvantage is that the number of times the number of calculations and the number of information is proportional to: N ^ 2/2
3] Sliding window and merger Algorithm Sliding Window method can discover a cluster information based on the angle between the information, and the size of the control window can be used to customize the acquisition of the cluster information. The advantage of Sliding Window is very accurate and adjustable. The disadvantage is that it is very cumbersome, the number of times calculated and the order of dimensions of the information space, such as 1000 dimension, about 10 ^ 1000 times, astronomical numbers.
Despiccation Algorithm: After the Sliding WinDow is analyzed, a simple algorithm can be used to quickly converge. The degenerate algorithm first finds the minimum angle between any two information, then makes anger, becoming an information vector, which shrinks to very little information vectors after several simplifications. The angles of these less information vectors are relatively large, and are different categories of information, that is, automatic clustering. For example: The information of 1000 groups is degenerate, and can be contracted into 15 categories six times, while the calculation required for six times is approximately 600,000 times, basically will not have difficulty.
Transfer from: Java Road Blog