Crawler key technology

xiaoxiao2021-03-06  41

1. Distributed, multi-threaded. It is a crucial problem for the scheduling problem of the task. Because of thousands of webpages, how to do not repeat, it is a crucial issue. Here is a good distribution. A good task scheduling mechanism.

Algorithm, there is no fixed name because it is a special field, but for the research of scheduling algorithms, it will be based on the traditional distributed scheduling algorithm.

2. Assessment of web page importance, this is very important, because Crawler will not grab all the pages, only 20% of the web page, so you must evaluate the importance of the web page, how to evaluate it is very important.

Generally, the evaluation technique uses the Page Rank algorithm. This is an algorithm for Google inventions. (By determining the number of pages A and the weight to determine the RANK value of the page A. Matrix's Page Rank is 5.0, is a medium score)

3. Because the webpage is grabbed, every time, Crawler must update the web page, so that the indexed web page is the latest web page. The easiest update policy is to re-download all the pages, but this time consuming is a month, it is Heavy Cost, and cannot be accepted. So, an excellent update algorithm is the foundation of a Crawler.

4. Compression algorithm, Crawler grabs the things, which will be stored locally. Generally speaking, because the amount of data is large, the store here has a compression mechanism that reducing the total storage capacity. Including transmission data between various data servers, it is a good compression algorithm that reducing the bandwidth burden brought about by communication.

转载请注明原文地址:https://www.9cbs.com/read-56451.html

New Post(0)