How to write a search engine distributed reptile in graduation design --- Statement

xiaoxiao2021-03-06  25

Write distributed reptiles with Python

1. Network connection requires continuous connection (Persistent Connection), DNS parsing bottleneck (first check local DNS cache)

Implementation: Based on Python Httplib (support for HTTP1.1) Connection (Python's HTTPLIB fully supports HTTP1.1), if not http1.1, you can use URLopen to connect to it) and control it on its socket object. The key is to join the read DNS local cache (in my mechanism, this is not the main problem can be temporarily ignored), and there is settimeout (IGLOO) (to get, use setDefaulttimeout () support (or use your DNS) Server, optimize the processing), and set the settimeout of the SOCK object to prevent long-term waiting for a web server that is possible to connect. (To test the connection module and DNS parsing module in access There is no URL in the default Time consumption) After the site's IP resolution, it is directly connected to the IP and avoids duplicate call DNS resolution. Example: Socket.getHostByname ("www.163.com")

The network connection download module is very important, you need to carefully test it, because it is possible to encounter some irregular web servers, if not considering it will crash the entire thread.

2, multi-thread: allocation of machine tasks and allocation of site tasks.

Implementation: (Implement on a machine, allocating the machine task after judging the consumption of the CPU of the machine; assign the site task after judging the connection between the site;

The allocation of the machine task: Adjusts the number of threads on one machine for the control of the machine. (Note that when you turn off the thread, let the thread complete the current running task)

The assignment of site tasks: is the assignment of a number of threads open by a machine. (It is also necessary to pay attention to the shut-off thread to complete the current task)

3. A better control of the web file tree traversal process, and the Web file tree is judged in the wide priority. (The entire network is a picture, and the model of a site is closer to a tree)

Implementation: Add a layer number when entering the queue at each address, then traversing the first N 1 to stop reading the first N 1.

4, use RobotParser to analyze Robots.txt

5, the role of a single machine Spider:

a) Traversal of the same 2 multi-thread 3 file tree

b) Send the acquired external URL back to the central controller and retrieve the new external URL from the central controller.

6. The role of the central controller:

a) Observe the status of each machine includes: CPU, memory, thread, site, network traffic

b) Observe the overall network traffic and connection conditions, and adjust the TIMEOUT according to the network.

c) Accept the external URL sent by each machine and count the repeated number of each URL. Then allocate to each machine. (Use a crawling policy controller to sort the external URL when allocation, IGLOO uses the Page Rank, and we can use the simplest repeating, the higher the important factor, the higher it is sorted)

d) Distributed URL Assignment Algorithm: IGLOO1.2 Secondary Hash Mapping Algorithm (Central Distribution Algorithm The central controller is easy to become a system bottleneck) Review hash algorithm, and it is the judgment of the URL visits (IGLOO) It is the URL TRIE lag merge policy). Berkeley DB can be used as an alternative to URL TRIE. Comparison of two implementations:

i. Now the idea: (facing the site, the information particle large) External link is just saving hostnames such as: www.163.com, the station access is maintained with the parsed IP address, with the relative link to get various pages, this is Maintain an external link list, a list of several sites. Advantages: Save memory, get comprehensive on the information on a site, and sort the statistics, sorting, and important sites. Disadvantages: The comprehensiveness of the link is not guaranteed, and more important pages cannot be obtained, and there will be no many important pages for each site. Ii. Old program: (facing the page, small information particles) all connected to the colleagues. Disadvantages: Waste resources, not necessarily comprehensive for the acquisition of a single site. Advantages: Comprehensive link diagrams can be obtained, you can use Page Rank to sort the list, and the page is more important in the front.

7. Analyze HTML (extraction of hyperlink) Fix (use Python's SGMLLIB) Disadvantage: The speed is too slow (may cause bottlenecks, it is better to pack it, have a chance to change it later)

转载请注明原文地址:https://www.9cbs.com/read-44895.html

New Post(0)