For search engines, when index quantity and search volume are large, the efficiency of index updates will gradually decrease, and the pressure of the server has gradually increased, so basically the utilization rate of the entire search engine can be said to be increasing, And with the difficulties brought by massive data storage, design a good distributed search engine will be a key factor that the search engine can face the future.
So what is the most important core issues of a distributed search engine?
1. Distributed information acquisition and calculation and the data of this data uniform, including the distribution of reptile / or corresponding data acquisition mechanism, and machining information for information
2. Distribution storage and management after data processing is mainly the mechanism of accurate positioning and update, increasing, deleting, and moving files.
3. The distribution of front-end search services mainly handles distribution mechanisms when mass-concurrent requests
Based on the above three basic needs, it is basically possible to construct a distributed search engine as follows: 1. Distributed Element Search Engine 2. Hatan Distribution Search Engine 3. P2P Distribution Search Engine 4. Local Traverse Search Engine
The following is gradually introduced to the above 4-class scalable search engines: 1. Distributed E-Search: Have multiple single search engines, the center search engine is a complete result of using the results of these distributed single search engines. Such The design requires that the search engine of each unit has the same sort algorithm and the substantially the same data output structure for finishing by the center search. For such search engines, the key design is to require indexes owned by each unit not constituting repetition, but the data acquisition (crawler) can take independent system to distribute it to each unit according to the rules. Advantages, simple design, fast, and any unit can be removed at any time but does not affect too much. Disadvantages, for large-scale concurrency is not a good solution
2. The hash distribution search engine has been hasced to have the index server and the document server in accordance with Query, and to locate the specific index server and locate the correct document server for any index word.
Advantages, compression, design simple disadvantages, more difficult to adjust for single index servers or document servers
3. Peer 2 Peer Search Engine Famous Napster is such a design that uses a centralized index, which combines a file source formed by a single computer around the world to form one of the world's largest P2P search engines in the world. The center index server in this design records only relatively critical information, such as positions (IP, serial numbers), song names, authors, etc., other information can be obtained from any online and have a comprehensive information of this section. At the same time, P2P can also establish some intermediate routing cache according to search, so that some search results are available on a single or similar node, speed up search speed.
Advantages, can be super large, basically no need to have maintenance cost disadvantages, the central server is very low, the information source is unstable
4. The search engine such as a local traversal search engine can use a variety of design, which is more feasible to cluster the information. After the information is created, it is only necessary to travel from a branch of the tree when searching. Local traverses should have certain rules and require relatively and accurate positional arrangements for each added index in the initial design, so that the efficiency of the search to ensure the efficiency of the search.
Advantages, easy to solve the compression, high search accuracy, high search efficiency, complex design, adjust the node where the index is located is not easy
Overall, the design method of search engines can be many, just throwing bricks, and believes that there will be more clever design in the future.