Search Engine Development Summary Li Rui Lirui@nic.ac.cn (Institute of Computational Technology, Chinese Academy of Sciences, Beijing 100080, China)
Abstract: This article briefly describes the origin and development of search engines. It introduces the research status at home and abroad. It has made a certain discussion on its classification, performance evaluation, key technologies, etc., based on this development trend, boldly prediction. Key words: search engine; web mining; information retrieval
The Internet has been growing since its birth, and its content is constantly rich, and the entire network gradually accumulates into an unprecedented super-large information library. The Internet plays an increasingly important role in people's daily lives and work, and people get information more and more. In the early days of the Internet development, the website is relatively small, and the number of web pages is small, so information look is easier. However, with the development of Internet explosive, ordinary network users want to find the required information is simply like a large sea fishing needle, so that it is lost in the ocean of information, there have been the strange phenomenon of "information rich, knowledge and poor". The search engine is to solve the technology that appears in this "value" problem. Search Engine referred to as SE) is an information processing system that collects, discovers information, understands, extracts, organizes, and processing information in the Internet with a certain policy, and provides the user with retrieval service, thereby playing information navigation The purpose, generally includes information collection, information finishing, and user query. From the user's point of view, it is a tool that helps people perform information retrieval.
1. The ancestors of the search engine in the modern sense is the ARCHIE invented by the Alan Emtage et al. In the University of Montreal University in 1990. Archie is the first program for an anonymous FTP website file on the Internet, but it is not a real search engine. Archie is a list of searchable FTP file names, and users must enter exact file name search, and Archie will tell users which FTP address can download this file. Since the Robot program specifically used to retrieve information like a spider (Spider) climbed, the search engine's Robot program is called a Spider FAQ program. The world's first Spider program is Mit Matthew Gray World Wide Web Wanderer for tracking the development of the Internet. Just starting it only to count the number of servers on the Internet, and later developed to capture the URL (URL). The real search engine appeared in July 1994. At that time, Michael Mauldin entered the John Leavitt's spider program to its index program, creating the Lycos that everyone is now familiar. In April of the same year, two doctoral students in Stanford, David Filo and the American Chinese Yang Zhiyuan (Jerry Yang) founded the super directory index Yahoo, and successfully enabled the concept of search engines into the heart, Yahoo is also known as the first Active search engine. The development of search engines has also entered the golden age. In September 1998, the two doctoral Larry Page and Sergey Brin, which were also Stanford University, successfully developed a new generation of search engines with the funding of venture capital companies. It has the characteristics of the required information faster and more accurately used than Yahoo, and is considered a representative of the second generation search engine. Now there are thousands of sites that can provide retrieval services on the Internet. These sites have different ranges, content, and retrieval methods. It is also characteristic. More famous is Google, Yahoo, Altavista, Dogpile, Baidu, etc. At present, search engine research, development is very active, and major search engine companies are investigating search engine systems in vain, and they are constantly emerged in new distinctive search engine products, and search engines have become industries in information. one. It is integrated and challenging in the field of information retrieval, artificial intelligence, database, data mining, natural language understanding. Also because the search engine has a large number of users, thereby diffracted many business opportunities, with good economic value. According to the research data of the 2003 China Search Engine Research Report, China's search engine market has reached RMB 520 million in 2003, an increase of 127% from 2002, which shows the search. The strong growth of the engine market; Yahoo said that the next five years of global search market will increase by $ 3 billion this year to $ 11 billion. As a bridge connected to the Internet, the search engine is increasingly attached to people, and it has also caused a high concern in computer science, information industry and business in the world. The results.
2. Classified Search Engine According to the technical principles used, it can be divided into the following three categories: 2.1. Catalog search engine: Collect information in manual or semi-automatic mode, after viewing information by the editor, manually forming information summary, and will Placed in the classification framework in advance. Most of the information is for websites, providing directory browsing services and direct retrieval services. This type of search engine is accurate, the information is accurate, and the navigation quality is high. The disadvantage is that artificial intervention, large maintenance, less information, information update is not timely. Typical representatives are: Yahoo (now Robot Technology), Looksmart, Open Directory, etc. 2.2. Based on the Robot search engine: Robot-based search engine provides more retrieval, sometimes called full text search engine (FULL TEXT). The Robot is collected from the Internet to establish an index database, retrieves the relevant records that match the user query condition, and then returns the result to the user in a certain arrangement order. Representatives of such search engines are: Google, Fast / AllTheweb, Altavista, InkTomi, Teoma, Wisenut, etc .; domestic representatives are: Baidu, "Tianwang", OpenFind, etc. 2.3. Yuan Search Engine: This type of search engine does not have its own database, but to submit the user's query request to multiple search engines, after returning the result, sorting the processes, returning to the user as its own result . The service method is a full-text search for web pages. The advantage of such search engines is that the amount of information returned, more complete, and the disadvantage is that the user needs to make more filtering. The famous meta-search engine has InfoSpace, Dogpile, Vivisimo et al (yuan search engine list), and the Chinese sector search engine is a representative Search Search Engine. In addition to the above three major classes, there are several non-mainstream forms: a collection of search engines, portal search engines: AOL Search, MSN Search, etc., free Link list (Free for ALL LINKS, FFA), etc.
3. Performance Index Search Engine is an Internet information search tool, so it can refer to the quality evaluation criterion of the traditional literature retrieval tool, combined with the search engine in the information organization processing and retrieval service, etc.; Track network information and facilitate network information users, the evaluation of it is based on the user's interests. It is popular that the search engine that can make most of the network users is a good search engine. Typically, we can measure the performance of a search engine from the following aspects: 3.1. Recall: Also known as the value, refers to the search engine provided by the search result in the relevant information document number and network information The ratio of the number of information documents, because the search result is a collection of documents after matching the document in the index database of the search engine, and this indicator is also a real reflection of the search engine on the network information coverage. 3.2. Precision: Also known as the quotation rate, is the degree of matching results provided by the search engine and the user information requirements, and the number of documents of valid information in retrieval results and the number of all documents provided by the search engine. 3.3. Search speed: Also known as the response time, the retrieval speed is generally dependent on two factors, that is, the network speed associated with the bandwidth and the speed of the search engine itself, only in the case of reliable technical support in both, Ensure the ideal retrieval speed. For a retrieval system, the recall rate and accuracy are difficult to do their best: when the recall rate is high, the precision is low; when the accuracy is high, the recall rate is low. For search engine systems, because there is no search engine system to overwrite all network resources, the recall rate is difficult to calculate. The current search engine system is very concerned about accuracy. For the above measuring method, there is a shortcomings, and other factors have not been considered. In the literature [7], one analytical model is established by the user-oriented system-oriented system analysis method, and some of the literature [8] Good measurement method. There are many factors that affect a search engine system, the most important thing is the information retrieval model, including the representation of documents and queries, evaluation documents and user query correlation matching policies, query results, and user perform correlation feedback. mechanism. In addition, we can evaluate it from the perspective of the functional requirements of the search engine. In the author, an ideal search engine system should have the following functional requirements: 1 Cover more Internet resources, the resource update cycle is not very long, can be updated in real time for some special information, one is one of the guarantees . 2 has as many options as possible, such as resource type (website, web page, news, software, ftp, mp3, flash, image, film, etc.), wait time control, return result quantity control, result period selection, filtering function Select, the result is displayed. 3 Powerful retrieval request processing function (such as support logical matching, phrase search, natural language search, etc.). 4 Detailed comprehensive search results information description (such as web page name, URL, abstract, result and customization of user retrieval requirements). 5 Support multiple language retries, such as providing Chinese and English search, etc. 6 You can automatically classify the results, such as classifying according to domain name, country, resource type, region, etc. 7 You can provide personalized services for different users.
4. Principles and Key Technologies Today's mainstream is based on Robot-based web search systems, which mainly describes this technology. The principle of search engines, can be seen as three steps: grabbound web pages from the Internet → establish an index database → Search in the index database. 1 Discover from the Internet, collect useful web information to automatically search for information in the Internet using high-performance Spider programs. A typical "network Luo spider" work is a way to view a page, and find relevant information, then it will start from all URLs in this page, crawling to the relevant page, repeating this process until all the processes The web page is collected back. Search Engine Spider usually regularly re-accesses all web pages, updates the web index database to reflect the update of web text. 2 Classify the collected information, establish an index database, analyze the collected webpage, extract related web information (including the URL of the web page, the page content containing the page content, "keyword, keyword location Generate time, size, link relationship with other pages, etc.), perform a large amount of complex calculations according to a certain correlation algorithm, obtain each web page for each keyword in the page text and the hyperlink (or importance) , Then use these related information to establish a web index database. 3 User search procedure When the user enters a keyword search, find all related web pages that match the keyword from the web index database. Because all related web pages have been well calculated for the correlation of this keyword, just sort according to the high-class correlation value, the higher the correlation, the more the rankings are Finally, the page generation system will organize the link address of the search results and the page content summary of the page content summary to return to the user. The above is briefly described, the search engine works, it is not difficult to see the basic components of the search engine: searcher, indexer, retriever, and user interface. Here is a key technique associated with related key technologies: 1 Robot technology, network robot ROBOT (often referred to as network spider spider, crawler Crawler, etc.) can be used in data statistics, data search, link maintenance, etc. for Internet. The network robots in the search engine mainly complete two functions, namely, obtain the links of the Internet and read the contents corresponding to each link. Robot starts from a pre-developed URLS list, after visiting a web page, analyzing it, extracts new URLs, add it to the access list, and accesses the Web in this way. Whether the Robot design will reasonably affect the efficiency of it accesses the Web, affect the quality of the search database. Robot is commonly distributed, parallel computing technology to improve the speed of information discovery and update. 2 Indexing Technology, the document information collected back from Robot is to be used to establish an index database. The establishment of an index has a big impact on the search engine, and the good index can improve the efficiency of the search engine system and the quality of the search results. It can be said that indexing techniques are core technologies in search engines, which are concentrated in search technology. It is very important that text analysis technology, which is the main support technology of the indexer. The contents of the text analysis include extracting an index, automatic summary, automatic classifier, text clustering, etc. Text analysis is based primarily about the vocabulary, hypertext tags, and hyperlinks included in the text. 3 Information retrieval and sorting technology, the ultimate goal of search is to obtain information required, and it is very annoying in a large number of information, it is difficult to achieve satisfactory. Even in real life, when you face a lot of hard work, you will often find that all information is useful. The current search engine is implemented in a commonly used search. Commonly used related information lookup methods have similarity function methods, classification (group) methods, etc. There are two more affected methods in this research area, which is the Page Rank method and the Authority and Hub method. Both of these methods use links to the page to determine the importance of the document.
4 User interface design, the role of the user interface is to enter the user query, display the results of the query, provide the user-related feedback mechanism. The main purpose is to facilitate the user to use search engines, high efficiency, multi-mode from searching engines, timely information. The design of the user interface and the theory and method of using human-computer interaction to fully adapt to human thinking habits. The general search engine provides basic search and advanced retrieval interfaces. Basic search interface only provides users entering keyword text boxes, some can enter some complex query expressions here, but this only applies to those search experts; advanced retrieval interfaces allow users to limit queries, such as logical operations ( In, or, non-), similar relationship (adjacent, Near), domain name (such as. 12, location, content), information time, length, and more.
5. After years of development, the development trend has become more and more powerful, and the services provided by the current search engine functions are increasingly comprehensive. The researcher statistics, the current search engine on the Internet has reached thousands, only the Chinese search engine reaches hundreds, which can be said to be a hundred flowers. However, with the sharp increase in WWW information, the current search engine is not friendly, long response time, too much dead link, and repeated information and excessive information, it is difficult to meet people's various information needs, search The engine will adapt to the direction of different users' needs to intelligent, personalized, accurate, specialized, cross-language retrieval, multimedia search, etc. 5.1. Intelligent search engine: It is the development direction of the search engine in the future, known as the "third generation search engine", in this respect, China Search is in front, has launched such search engines for user trial. The intelligence of search engines is embodied in two aspects: First, the understanding of search requests, the second is to analyze the content of web page. It utilizes smart agent technology to reason for the user's query plan, intent, interested in interested, automatically collect filtering, and automatically interested in users, submitted to users to users. This also includes functions of service multiplexing, personalization, resulting accuracy, cross-language search, etc. 5.2. Pay attention to the accuracy of query results, improve the effectiveness of the search: Solve the phenomenon of excessive query results currently have the following methods: a) Build content-based search engines. The content is not based on a glyph, but attempts to understand the user's request, while selecting documents that meet the user requirements according to the content of the document. That is, the user does not express the true use of the query statement in the query statement, and the intelligent query function of the natural language is achieved. The currently mature solution is to rely on semantic network, Chinese word, syntax analysis, handling synonyms, etc. Chinese information processing technology to maximize user needs. b) Translate user questions into systems known issues, then answers known issues to reduce dependence on natural language understanding. c) Use the body classification technology to classify the result, using visualization technology to display classification structure, users can only browse the categories of interest. d) Site gauge or content class gather, reduce the total amount of information. e) Let the user choose the return result, and the secondary query is a very effective means. 5.3. Implementing a cross-language search: Search engine is retrieved for multiple languages database, returning documents that can answer all languages that can answer user issues. If you are equipped with machine translation, you can make the return result display in the language familiar to the user. Although the technology is currently in the preliminary research phase, it is difficult to say that the language is in terms of expression and semantics, but it is indeed a development direction. 5.4. Provide support for natural language retrieval: In order to improve the search engine's understanding of users, there must be a good search question language. In order to overcome the shortcomings of keyword retrieval and directory queries, natural language is now acknowledge. Ask. If Google has Google Answer dedicated to answering questions, Microsoft has Answerbot, etc. Users can enter a simple question, such as "How Can Kill Virus of Computer?". Search engines After the analysis of the question, or directly give the answer, or boot the user from several alternative problems. The advantage of natural language is that one is to make network communication more user-friendly, and the other is to make the query more convenient, direct and effective. In terms of the above example, if you use keyword queries, many people will retrieve the word "virus". The results will inevitably include all kinds of viruses, how to generate many invalid information, and use "How can Kill Virus of Computer?", the search engine will provide information to the user to provide the user, improve the retrieval efficiency.
5.5. Multimedia Search Engine: Network resources are colorful, with many types of information, users need information is not entirely the form of web pages, from the user's perspective, inevitably require search engines to override more network resources. Now there are many search engines that have provided web, news, pictures, music and other resource search. Of course, the scope can also be wider, and then search for newsgroups, software, ftp, flash, papers, etc. 5.6. Specialized Search Engine: It is to establish a certain industry, a subject, a certain topic, and a certain area's information. It is very practical, if there is business inquiry, enterprise inquiry, people's name inquiry, email address query and Recruitment information query, etc. This kind of professional search engine is one of the future directions. 5.7. Desktop Search Engine: This kind of engine is actually a software. After downloading, put it on the computer desktop, users don't have to turn on the browser frequently, but directly through it, you can fully implement the search process, more. You can search local, local area networks, and information on the Internet. It completely crosses the traditional search mode, crossing the browser, truly realizes the search everywhere. Search Engine is out of the browser is a development trend, Google, Yahoo, etc. have planned to launch their own desktop search software, while Microsoft also intends to design the search to the desktop. "Network Pig" software in China has launched in China. There are also some other technical development, such as meta search engines, mobile agents and XML technology, voice search technologies, etc. With the continuous development of technology, search engines will become a good helper for people. 6. References [1] Li Xiaoming, Liu Jianguo Search Engine Technology and Trend [2] Search Engine Expansion Search Engine Development History http://www.se-express.com/about/about.htm[3] Blog China Today - Search Engine Development History http://www.blogchina.com/new/source/130.html [4] Zhuang Yi, Li Haohong Engine Technical Status and Development Trend Moves Computer Times 2002 No. 8 [5] Wang Hongmei, Zhu Hongxiu Wang Ling to the Chinese Search Engine Future Development Discussion Northeast Electric Power Institute Journal of Northeast Electric Power 2001 No. 21 Issue No. 4 [6] Zhang Xiaogang, Li Minghu Intelligent Search Engine Technology Research and Development Computer Engineering and Application 2001 No. 24 [7 ] Ma Wei, Li Heng Search Engine Performance Evaluation New Century Library 2003 No. 6 [8] Feng Yuanjie, Liu Zhengchun, Wang Jianyi Search Engine Main Performance Evaluation Index System Research Information Journal 2004 Volume 23 Phase 1 [9] Lingmeixiu's main problem and its development trend of search engine and its development trend, the discussion of college library work 2001, 5th, Volume 21 [10] Cai Ruiping, the search function characteristics and skill books of the Search engine Information in April 2003 [11] Peng Honghui, Lin Yuyue Internet Search Engine and Yuan Search Engine Computer Science 2002 Vol. 29 No. 9 [12] Li Yuanming Characterization Analysis of the Search Engine Technology and Its Future Development Trend Information Retrieval 2002 No.7 [ 13] Lu Shengguang, Ding Fangzhong Search Engine Usage Technical Review and Development Trend Discuss Guangdong Communication Technology 2002 Vol.19 NO.5