Demo search engine research and design

xiaoxiao2021-03-06 51

Research Institute of Computational Technology Li Rui COLIN719@126.com

Abstract: The paper briefly introduces the knowledge of the Yuan Search Engine, and proposes a design idea of a meta search engine system. The system uses a feedback mechanism to conduct online learning and adjustment when the user is viewed. The design of the search syntax is proposed in the system design, and the user-like member search engine is proposed, and the support of personalized services, and gives the key technologies for establishing a meta search engine system. Finally, the meaning of the system and the problem that still needs to be solved. Keywords: Internet Search Engine Search Engine Information Retrieval Syntax

One. Introduction In the early days of the Internet development, the website is relatively, the number of web pages is small, so information look is easier. With the rapid development of the Internet, people are more and more relying on the network to find the information they need, but accompany the Internet explosive development, ordinary network users want to find the required information is simply like a large sea fishing needle, so that it is lost in the ocean of information. What we said, the "rich, knowledge and poverty" appeared. The search engine is to solve the technology that appears in this "value" problem. Search Engine referred to as SEBCH ENGINE is collected, discovered, organizes, and processes information in the Internet at a certain policy, and provides the user with the purpose of information navigation. Now, there are many online search engines, more famous with Google, Yahoo, Altavista, Dogpile, Baidu, etc. According to the information collection method and service providing method, the search engine system can be divided into three major categories: directory search engine, represented by Yahoo (recently changed to the full-text search technology); full-text search engine, represented by Google; Search engine, represented by dogpile. two. Yuan Search Engine Overview Serve, a single search engine network coverage can only cover 30-50% of the entire Internet resource [3], so I can't guarantee the full rate; plus any search engine design, all With its specific database index range, unique function and usage method, and expected user group points, resulting in the same search request, the repetition rate of the query results in different search engines is less than 34% [5], and the ratio is also It is impossible to guarantee; therefore, if you want to get a more comprehensive and accurate result, you must repeatedly call multiple search engines, compare, filter and mutual confirmation of the return result. The yuan search engine came into being. 2.1 Defining the Meta Search Engine referred to as MSE is a engine based on the independent search engine, calling other independent search engines, also known as "The Mother of Search Engines". Here, "Yuan" (Meta) is "total", "transcendence" means, the Yuan search engine is integration, call, control, and optimization of multiple independent search engines. The independent search engine that can be utilized relative to the meta search engine is called "Source Search Engine), or" Component Search Engine). In terms of functions, the meta-search engine is a filter channel: the output result of multiple independent search engines is used as input, and the final result is formed, and then the final result is output to the user. The typical work of the 2.2 yuan search engine can be summarized. 1 The user enters the query request by the unified query interface, and the meta search engine makes a certain pretreatment of the query. 2 yuan Search Engine Select a number of member search engines according to the member search engine scheduling mechanism. The 3-yuan search engine performs localization processing for the original query according to the query format of the selected member search engine, converted to the query format string required by the member search engine. 4 Send a formatted query request to each member search engine, waiting to return the result. 5 Collect the return result of each independent search engine.

6 Combine the return result, for example, eliminate repeating links, dead links, etc., forming the final result. 7 Returns the final result to the user in a certain format. 2.3 Dollar Search Engine Features The SEO is different from the independent search engine. It mainly has such features: 1 Do not set up a huge web database, save storage device 2 provides a unified external model, submit a query to multiple separate searches Engine 3 Based on the secondary processing of independent search engine 4, the source search engine and its local correlation of the results record are provided, and provide global correlation. three. The development trend of Yuan Search Engine Currently, the research of the Yuan Search Engine is very active. It is integrated and challenging in the field of information retrieval, artificial intelligence, database, data mining, natural language understanding. Also because the search engine has a large number of users, there is a very good economic value, which is estimated that there is now billions of dollars in the global market, which has aroused computer science, information industry and business communities in the world. The high concern has been put into a lot of manpower, material resources, has also achieved a good job. An ideal meta search engine should have the following functional requirements: 1 Cover more search resources, you can choose and call the independent search engine, and automatic scheduling according to a certain scheduling policy. 2 has as many options as possible, such as resource type (website, web page, news, software, ftp, mp3, flash, image, film, etc.), wait time control, return result quantity control, result period selection, filtering function Select, the result is displayed. 3 Powerful retrieval request processing capabilities (such as support logical matching, phrase, natural language search, etc.) and different search engines retrieval symbol rules, characters conversion function (such as search engines that do not support "Near" operators, Automatically implement the "NEAR" to "AND" operator conversion, etc.). 4 Detailed comprehensive search results information description (such as web name, URL, abstract, source search engine, result and customization of user retrieval requirements). 5 Support multiple language retries, such as providing Chinese and English search, etc. 6 You can automatically classify the results, such as classifying according to domain name, country, resource type, region, etc. 7 You can provide personalized services for different users. There are a lot of search engines on the Internet, which is mixed. In terms of functional implementation, there are many focuses on "ideal". Some yuan search engines are doing well in some ways, but there is a defect or improvement in other functions: Most of the yuan search engines do not support natural language retrieval, and Chinese retrieval are not supported. The function of the Yuan Search Engine is subject to the double restriction of the source search engine and the sector search technology: On the one hand, the powerful function of the source search engine is restricted in the meta search engine and cannot be fully reflected, and on the other hand, any one Extory search techniques can not explore and utilize all the features of the independent search engine. With the continuous emergence of new technologies, it will make the Yuan Search Engine better, better user satisfaction, these techniques are: 1. Improve the intelligent understanding of users searching for users, reflecting support for natural language query requests. 2. Determine the search engine information collection range, improve the pertiance of the search engine, reflect the topic search, multimedia search. 3. Information filtering and personalized services based on intelligent agents. 4. Pay attention to the research and development of cross-language retrieval [9], providing support for multi-language retrieval, providing localized search services. 5. Improve the accuracy of information query results and improve the effectiveness of the search. four. The design concept of a meta search engine is based on the above studies, and we propose a design idea for a meta search engine.

In this conception, we have adopted a feedback mechanism, but we do not have specific refinement, only one overall framework, for the functional modules in the architecture, we have analyzed their functions and implementation techniques. Provide several optional technologies. A number of modules can be selected when implementing, to reduce system complexity; can also increase several functional modules to increase the functionality of the system, that is, this design concept has good scalability. 4.1 System Structure Framework

4.2 Functional Module Description 4.2.1 Graphical User Interface (GUI) This part is the procedural and user's handover, mainly used to accept the user's original query request and display the final result to the user. Several interfaces can be used when implementing, such as using command line mode, graphical interface, etc. This part does not involve the processing of data, and multiple views of one data can be achieved well. Implementation can consider a variety of human machine interaction technology, and submit the user's query request to the system. On the interface, you can set the member search engine list, the longest wait time of each member search engine, return results, etc., and the result display mode, sort policy, classification method, etc. This part of the information can be saved in the client's cookie, so that users do not have to enter their own custom information every time, providing personalized services. You can also save your search records in cookies, you can use knowledge and excavation of search history and search habits, used for pattern discovery. 4.2.2 Querying the pre-processor This section accepts the original query request from the GUI and preproces the original query request, providing cross-language retrieval and natural language support. This part needs to use query syntax and operation, here briefly introduces our design query syntax and operational rules. The query syntax and operational rules we designed are as follows:  Boolean logic operations include and, or, not, and (), etc., which is the most basic, most commonly used syntax rules: And indicates that all keywords are included in the search results, you can use ' ' (Plus sign) and spaces are replaced. OR means that at least one keyword is included in the search results, you can use ',' (comma) instead. NOT represented the keywords after the search results, you can use '-' (minus sign) or '! '(Exclamation number) is replaced. For example: Search for JFC Not Mfc, only JFC is included in the result, without including MFC. () It is used to limit the () operator in the priority, role, and mathematical operations.  Other simple and relatively common syntax rules "" Use to support phrase search, search engine "" keywords or combinations thereof as a monolithic phrase. For example: Search for information about Search Engine, enter "Search Engine" and search the "Search Engine" as a phrase. If you don't use "", you will search for information that contains both Search and Engine, which is clear that there are many of them are not needed. The wildcard wildcard is used instead of a number of characters, similar to the regular expression. The wildcard can be '*', representing any number of characters, '?' Represents the characters on the current location can be any character.  Common Senior Search Submissions Rules NEAR can define keywords that appear in a certain area, these keywords may not be adjacent, the smaller the interval, the more the arrangement of the arrangement, and the interval with NEAR / N control, N is one Specific values, indicating that the interval does not exceed N words.

INTITLE Limited only in the title Search keyword INURL Limit only in the URL Search Keyword INSITE Limited Limited only in a given site search requests can be described in the following sections: INCLUDE, no The containing keywords, any one of the keywords (ANY), the phrases or sentences (ALLs, area, fields, subject, location, etc.) of the query. For the original query string from the GUI, make the following processing: 1. Conduct natural language parsing, query the database, if you find the appropriate answer, return the solution to the user. 2. According to the search symbol rules, scan the query string, form a formatted query string, that is, which part is all included, which part is not included. 3. Read "STOP Words" from the database, compares the information in the formatted query string, eliminating the keywords that obviously unnecessary search. 4. "STEMMING" processing is performed on keywords in the formatter string. This step can be handed over to each member search engine to reduce the complexity of processing. 5. According to the information of the keyword, the field, theme, region, location and other information is formed. 4.2.3 Members Search Engine Screening When the program is started, a number of member search engines are set by default based on the previous users' search history and habits. Users are not satisfied with yourself to set a list of members. In addition, the program also has its own search engine automatic scheduling mechanism, according to the user's query theme, the field, the region, etc. information, and the performance performance of the member search engine in the previous search (the number of response time, the number of returns, user satisfaction, the field is targeted Sex, which advanced retrieval functions, etc.), generate a list of suitable member search engines. Since the information of the member search engine (especially the formatting information of the query string) often changes, if they secure their code, it is obviously unreasonable in the primary program of the meta-search engine, so we use the member search engine description file, Describe in XML, use formal description, for each newly added member search engine, as long as it sets a description file in this form, it is easy to add it to the system. 4.2.4 Query Distributor Receives Members of the Member Search Engine Scheduling By Member Search Engine Scheduler, connects to the database, reads information of these member search engines, including host information, connection information, query parameter string formatted information, and more. According to this information, the synchronization starts a number of threads, and the corresponding member search engine is performed separately. The query information processed by the query pre-processor is transmitted to them. Part of this part is a large part of the database, in fact, some information allows the query agent to connect the database, but in order to reduce the number of databases, the part of the function is concentrated once, multiple processing, multiple times . 4.2.5 Query the agent provides the interactive interface of the meta search engine and a specific member search engine. It first receives the query format string sent from the query distributor. Then request your own query parameterization information to the query distributor, and then locate the query format string according to the query parameterization information, which is converted into its own requirements. Here there is a detail that needs to be handled, that is, some member search engines do not support some advanced retrieval functions of this meta search engine, such as: do not support phrase retrieval, wildcard function, etc. When processing, remove this part of the request information in the original query string. Next, the localized query request is sent to the member search engine, waiting for the return result.

Since some services are not available, you can use a program similar to the ping command, first test if the server is available, determine if the query request can be used, start the connection, set a waiting time threshold, and give up after the timeout. After receiving the return result, use the HTML parser to extract the search result from the result page, you need to include the following information: link information, get this link member search engine, order information in the member search engine, the site information of the target page, Description information, anchor text, etc. 4.2.6 Integrated Processing Module This is the core module implemented by the meta-search engine. The quality of the execution efficiency of a meta search engine is closely linked to the implementation of this module. It requires a number of functional modules, please refer to Section 4.3 for specific implementation :  结果 The result collect module is responsible for synchronizing the return result of the member search engine, and rendering the return result of the first-resulting member search engine to the user to reduce the waiting time of the user.  The web filter module removes the repetitive link in the return result according to the evaluation standard of the repetition, removes the redundant link information according to the user's resource requirements, time limits, domain restrictions, etc.  The web page sorting module is fused to process the search results based on a certain result fusion technology.  The integrated processing module is responsible for submitting the final result to the GUI, which is presented to the user by the GUI. This module is also responsible for evaluating this search using the search evaluation mechanism and records this search in the client's cookie. 4.2.7 Database The database here is a more general concept that includes both the actual database, including some configuration files and settings, etc., which are used to save the data required for use in the system. This information includes: information about the natural language problem, the information of the member search engine (host information, function information, parameterization information, retrieval performance information), user information (search history information, personalized setting information, personal information, etc.), Prohibited words, vocabulary (synonyms, antonyms, translation information, domain information, subject information, etc.). In the specific implementation, some information can be placed on the client to reduce the storage pressure of the server. 4.3 Key Technologies in Implementation  重 Evaluation Standards [12] HyperLink, Anchor, Description, and the like in the search results can be used to determine whether the two results are repeated. We judge based on the following strategies: 1. First, it is determined whether the HyperLink of the two results is the same, and if the same is considered as the same result. 2. Compare the similarity of the URL, if the host IP address, path, and file names are identical, and it is also considered to be the same result. 3. Compare metadata, such as title, authors, abstract, and size, exceeding the results of the similar degree threshold, are considered to be the same. For this, in order to improve the response speed of the system, it may not be realized.  Results Fusion Technology [6, 13, 14] It can be seen from the working principle of the meta search engine, and the resulting fusion technology is critical, so many methods are also present. More simple methods are: presented the results of the fastest search engine to the user; showing the return results of each search engine, do not do any processing. More complex is to achieve the results of the results according to a certain strategy. In the literature [13], Zhang Weifeng et al. Gives four synthetic algorithms. In the literature [14], four typical synthetic algorithms were given for different situations for different situations. Please refer to the relevant literature for details. The essence of the results fuse is the reordering process of retrieving results.

We propose such awareness-based cognition: an importance of a search result depends on 3 aspects: retrieves its number of members search engines, which retrieves the sort position in each member search engine, and retrieves its member The performance evaluation of the search engine. If the M members search engine retrieves it, it is Ri, which is Ri, the performance evaluation of the i-th member search engine is Wi, then the final weight P in this result is: p = Σ (ri * wi) i = 1 ... m According to the user's setting, further processing of the search results: Detecting the target page exists to eliminate death links; retrieve the target page, do text analysis To provide a higher correlation judgment and provide a web snapshot; classify the processed result, you can classify in the field, subject, site, etc.  If the effective information extraction technology is received after receiving the return result of the member search engine, it is important to extract the required search results from the results page. Since the techniques used between member search engines are different, the structure is also very different, and whether it can be extracted correctly is a very tricky problem. Based on such a cognition: The results of the search are dynamically generated, so the required results must be packaged, that is, one head and one tail, the content between the head and the end is our What you need. The current method is to use artificial way, look for this head and tail, then tell the system in configuration information, which is responsible for extracting the results required by the query agent based on this information. There is also such a realization method, which is based on statistical methods, using artificial intelligence technology, allowing the system to have self-learning, such as artificial intervention, can form the information of the member search engine in the results extraction. Now, Google provides Web Services, you can directly extract the corresponding information (retrieval result, response time, results quantity, document correlation, etc.), but only as a registered user can be used without limitation. This may be a better solution, because independent search service providers can more clear their systems and the techniques, but also provide more direct information about the results we need.  Member Search Engine Scheduling Mechanism and Performance Evaluation Mechanism [3,12] Members Search Engine Scheduling Technology is the technical core of the Yuan Search Engine, that is, determines which member search engines are sent to which member search engines are sent, and can receive a good search effect. Each member search engine under the yuan search engine has its own text database composed of a series of documents. Members Search Engine Scheduling Technology is the list of members search engine for each query that most likely containing documents, this pair search The engine's execution efficiency is critical. Four methods were mentioned in the literature [12]: a simple algorithm, qualitative method, quantitative method, and learning-based methods. In order to realize the automatic scheduling of the member search engine, we use the user's feedback information to implement the learning-based scheduling mechanism, which requires a member search engine performance evaluation mechanism. The evaluation is based on multiple retrieval of the record data, including the response time, the number of returns, where the main part is the document correlation, which can be submitted by the user (this is the best way, but generally, the user is only Using, not feedback), or by tracking the user's click chain activity. Evaluation is a hierarchical: Based on the evaluation of single search results, based on the evaluation of a retrieval activity, based on the evaluation of the search term, based on the evaluation of the search field, the evaluation of the overall retrieval performance is evaluated.

For the search request given by the user, if each search engine has respective search engines on requesting keywords, select some search engines for this keyword evaluation data for retrieving; otherwise, it is to determine which field belongs to it. And there is a member search engine in this field to evaluate data in this field, select some of the optimal search engines in this field to search; otherwise select some search engines optimal in the overall retrieval performance; if the user sets the corresponding speed The fastest or return result is the highest, etc., then some search engines pre-relying on the corresponding indicator are selected. Fives. Analysis of the application search engine in the e-commerce can also reach out in e-commerce. Now there are many websites to carry out business, which is one of the profit channels of some search service providers. . In addition, according to the user's registration information, search history information, search for keyword belongs, search habits, access records, etc., you can discover users' potential to buy desires and interested goods, which can be used by e-commerce sites to find Their potential customers can regularly send them a list of goods updates, and so on. six. Summary The Internet is a huge source of information. It is in a long period of expansion, and people now gradually tend to find the information you need on the web. In order to facilitate the use of rich resources on the Internet, people have developed a variety of tools with the research results of relevant disciplines, and the search engine is the most commonly used tool. Building a highly efficient meta-search engine on the basis of an existing independent search engine can extend the processing capabilities of the independent search engine, improve the search rate, and may further increase the ratio. However, the autonomy of each member search engine has caused integration difficulties. The difficulties are mainly from: the difference between the retrieval interface, the different document indexing methods, the difference between the relevant functions, the difference in the query parameters, and the strength of the retrieval function. Our system has absorbed some of the advantages of some successful systems, and also has its own characteristics: give your search syntax; evaluation mechanism for search engine search results; member search engine automatic scheduling mechanism; designed search engine description file Methods, the system has good scalability; gives its own result fusion algorithm; can be used to use user feedback to perform autonomous learning and adjustment, so that the system has adaptability. Seven. Subsequent Work We selected Java as a programming language tool for implementing the system. We have used the object-oriented software engineering theory to analyze this system structure, and use Java to implement some of these classes in which some interfaces are defined. The next job is to achieve the entire system, and further optimize and improve the partial algorithm. You can also add a number of functional modules to make the system function more powerful as needed, and the system is more robust.

转载请注明原文地址:https://www.9cbs.com/read-72140.html

9cbs

New Post(0)