Larbin Efficient Search Engine Reptile Tools :: [Search Engine]

xiaoxiao2021-03-06 27

Leaving Dallas has been ahead of ten days. Basically, in addition to seeing people everywhere, there is basically no other time to learn new knowledge, and there is no job to finish the work.

Niu.la, booso, luliang.dhs.org and WESPOKE are downtime, which seems to have no maintenance at the end of the year.

ITSeek developers asked Larbin many times, I am here to make a simple introduction to Larbin. Because relative to complex systems, LARBIN has a high degree of configurability, and good work efficiency.

1] Introduction to Larbin Larbin is an open source network reptile / web spider, developed by the French young people Sébastien Ailleret independently. The LARBIN object is to track the capture of the URL of the page for extension, and finally provide a wide range of data sources for the search engine.

Larbin is just a reptile, that is, Larbin only grabs the web page, as for the thing of Parse, is done by the user. In addition, things that store the database and establish an index are not available.

Latbin initial design is also based on the principle of design and simple but highly configurable, so we can see that a simple LARBIN reptile can get 5 million web pages a day, it is very efficient.

2] Larbin's performance characteristics is my evaluation of Larbin. In April this year, I did a test for Larbin's performance. Luliang.dhs.org is my own usual server, CPU is 1G, memory 512, and other performance general, because it is purchased three years ago.

I use my own web page six-wing as an entrance, running the capture of the URL within the 5th floor.

Some data on the record: Internet IO: 500-700K / PER Second (I want about my network download bottleneck) CPU TOP: 5% -15% Disk Consume: 1m / s, basically one hour climb three G's webpage. Almost 200,000 page URL analysis: 2 million -3 million per hour

3] Larbin's role, many people first saw Larbin, I didn't know where to start, then I will briefly introduce LARBIN's function and practical applications. 1. LARBIN gets a single, determines all the coupling of the website, and even mirror a website. 2. LARBIN establishes a URL list group, for example, after URL Retrive for all web pages, the acquisition of XML connection is performed. Or MP3. 3. LARBIN is customized as a source of information from the search engine (for example, the captured web page can be placed in a series of directory structures per 2000).

Attribution, LARBIN should be a product that is noticed by the majority of search engine enthusiasts, although its function is gradually accepted and replaced by Nutch, but its beautiful design on the reptile is worthy of praise.

转载请注明原文地址:https://www.9cbs.com/read-41063.html

9cbs

New Post(0)