All Internet users have used search engines, ALTAVISTA, INFOSEEK, Hotbot, Network Compass, Northern Daily Net and Hua Net Landscape, etc., their index databases are involved in more than 100 million pages on the Internet (Altavista and Hotbot), Peking University Tianwang also collects 320,000 WWW pages (domestic), index database establishment needs to access these pages and then index, how to do access to so many pages, now the search engine is for English or Chinese, Using an online robot to achieve online search (Yahoo! is an exception).
Online robot
Online robots are also referred to as Spider, Worm or Random, and the core is to get information on the Internet. The robot uses the hypertext link in the home page to travers the web, and references from an HTML document to another from an HTML document through the URL. The information collected by the online robot can have a variety of purposes, such as establishing an index, an HTML file verification, URL link verification, update information, site mirror icon, etc.
Robot Looking for the Algorithm for WWW Documents
The robot will crawl online and therefore need to establish a URL list to record the trajectory of access. Using hyperscribe, pointing to other documents is hidden in the document, you need to extract URLs from it; robots are generally used to generate an index database. All WWW search programs have similar steps:
1) The robot removes the URL from the list of startings and read its content from the Internet;
2) Extract some information from each document and put it in an index database;
3) Extract the URL to other documents from the document and add it to the list of URLs;
4) Repeat the above three steps until no new URL discovery or exceeds some restrictions (time or disk space);
5) Plus the query interface to the index database, release it online.
The algorithm has depth priority and breadth priority two basic search strategies.
The robot determines the search strategy in the way URL lists:
1) Advanced first out, form a wide range of priority search. When the start list contains a large number of web server addresses, the breadth priority search will produce a good initial result, but it is difficult to go deep into the server.
2) Advanced, the depth priority search is formed. This can produce a better document distribution, easier to discover the structure of the document, that is, find the maximum number of cross-references.
Result processing technology
Main factors for web selection
Search engines should be able to find sites corresponding to search requirements and sort search results according to their correlation. The degree of correlation here refers to the frequency that the keyword appears in the document, up to 1. When the frequency is higher, the higher the correlation degree of the document is considered. However, because the current search engine does not have intelligence, unless you know the title of the document to find, the first result is not necessarily "best". So some documents despite the high correlation degree, it is not necessarily the user more needed.
Search engines are a computer network application system with high technical content. It includes network technology, database technology, retrieval technology, intelligent technology, etc. In this regard, because many advanced technologies abroad are based on the Siwen kernel, we can't just introduce photo and copy. As a Chinese search engine, how to play our own strengths in Chinese, develop the core technology of our own copyright, so that we have a favorable position in the competition of Chinese search engines.
Four main factors for web selection:
a. The size of the web database is mainly manually browsing.
b. Retrieve the response time, mainly due to the program. The program first wrote the time to access the search engine, and then after the record, record the time at the time, then the two times will be reduced, and the time to retrieve the response.
c. The quality division of the web page is mainly scheduled.
The search engine always returns the search result to the user, and the result shows the quality of the search engine to the search engine. Therefore, the result is displayed, how to sort, whether sufficient information (internal code, file size, file date, etc.) have a great impact on the user's judgment on the search results.
d. Correlation between each website is related to the following factors: the correlation between the various websites; the correlation of the search results can be distinguished.
l The artificially set a related coefficient of the website, such as Yahoo 1.0, Goyoyo 0.94, etc .;
l Link, the number of keywords that appear in Summary;
l Record the return time, that is, the time to retrieve the response
Resulting
(1) Press frequency scheduled order
Generally speaking, if a page contains more keywords, the correlation of its search target should be better, this is a very common solution.
(2) Sort by page
In this method, the search engine records the frequency of the page that it searches is accessed. People access more pages should usually contain more information, or other attractive strengths. This solution is suitable for general search users, because most of the search engines are not professional users, so this solution is also suitable for general searches.
(3) Further purification (REFINE)
Optimize the search results in accordance with certain conditions, you can choose the category, related words, etc.