For search engines, when index quantity and search volume are large, the efficiency of index updates will gradually decrease, and the pressure of the server has gradually increased, so basically the utilization rate of the entire search engine can be said to be increasing, And with the difficulties brought by massive data storage, design a good distributed search engine will be a key factor that the search engine can face the future.
So what is the most important core issues of a distributed search engine?
1. Distributed information acquisition and calculation and data unified
This includes the distribution of reptiles or corresponding data acquisition mechanisms, and processing information on information.
2. Distribution storage and management after data processing
Mainly the mechanism of accurate positioning and update, increase, delete, movement
3. Distribution of front-end search services
Mainly handling the distribution mechanism when mass concurrency request
Based on the above three basic needs, it can basically construct a distributed search engine as follows:
Distributed meta search engine
2. Half distribution search engine
3. P2P Distribution Search Engine
4. Local traversal search engine
The following is gradually introducing 4 types of scalable search engines:
Distributed dollars search:
With multiple single search engines, the center search engine is a complete result of using the results of these distributed individual search engines. Such a design requires that the search engine of each unit has the same sort algorithm and the substantially the same data output structure for finishing by a central search.
For such search engines, the key design is to require indexes owned by each unit not constituting repetition, but the data acquisition (crawler) can take independent system to distribute it to each unit according to the rules.
Advantages, simple design, fast, and any unit can be removed at any time but does not affect too much.
Disadvantages, for large-scale concurrency is not a good solution
2. Half distribution search engine
The index server and document server are shed according to Query, and to locate the specific index server to the correct document server for any index word.
Advantages, compression, simple design
Disadvantages, more dynamic adjustments such as a single index server or the capacity of the document server
3. Peer 2 Peer Search Engine
The famous Napster is such a design that uses a centralized index, which combines the file sources formed in a single computer around the world, constitutes one of the world's largest P2P search engines in the world.
The center index server in this design records only relatively critical information, such as positions (IP, serial numbers), song names, authors, etc., other information can be obtained from any online and have a comprehensive information of this section. At the same time, P2P can also establish some intermediate routing cache according to search, so that some search results are available on a single or similar node, speed up search speed.
Advantages, can be super large, basically no need for maintenance costs
Disadvantages, the central server has a low update efficiency, and the information source is unstable.
4. Local traversal search engine
This type of search engine can use a variety of design, which is more feasible to create a message tree after obtaining information. When searching, you only need to go to a branch from the tree. Local traverses should have certain rules and require relatively and accurate positional arrangements for each added index in the initial design, so that the efficiency of the search to ensure the efficiency of the search.
Advantages, easy to solve the compression, high search accuracy, high search efficiency
Disadvantages, complex design, adjust the node where the index is located is not easy
Overall, the design method of search engines can be many, just throwing bricks, and believes that there will be more clever design in the future.
Transfer from:
http://www.wespoke.com/archives/001020.html
Http://blog.9cbs.net/aresky/archive/2006/06/19/812669.aspx