Hu Zhaohui (Zhejiang University Computer Department) Wang Haiwei (Ningbo Haifeng Plastic Co., Ltd.) ---- One, Introduction ---- With the rapid development of the Internet, people are more and more relying on the network to find the information they need, but Due to the number of information on the Internet, it is what we often say "Rich Data". So how to effectively find the information we need, it has become a key issue. In order to solve this problem, the search engine is born. ---- Now there is a lot of search engines on the Internet, more famous with Altavista, Yahoo, Infosek, Metacrawler, Savvysearch, and more. A lot of search engines have also been established, such as Sohu, Sina, Northern Net, etc. Of course, because they have not been established, the information searches are improved and improved. ---- Alta Vista is a very fast search engine, because it is powerful hardware configuration, making it a complex query. It is mainly based on keywords, and its roaming field has Web and usenet. Support Boolean, "OR", "NOT", while adding the most similar positioning "NEAR", allowing wildcards and "back" search (such as: You can find all Web sites to a page) ). You can decide whether to add weights to the search for the search, and look for them in the document. The advantages of being able to perform phrase queries rather than simple word queries are obvious, for example, we want to find a phrase "to be or not to be", if you just break them into words, these words belong to STOP Word. This query will not have any results, but it is easy to return to some results as a whole, such as information about Hamlet or Shakespeare. The score of the web page obtained by the system is based on your search phrases contained in the web page, which are determined by the distance between the document and the distance between the search phrases within the document. At the same time, you can translate the resulting search results into other languages. ---- Exite is called a "smart" search engine because it establishes a concept-based index. Of course, it so-called "intelligence" is based on flexible applications for probability statistics. It can simultaneously perform an index based on concepts and keywords. It can index web, usenet, and category ads. Support for Boolean operations such as "AND", "OR", "Not", and can also use symbol " " and "-". The disadvantage is that there is no specified web page size and format in the returned query results. ---- Infoseek is a simple but powerful index, one of its advantages is that there is an extensible classification for topic search. You can use the subject phrases of your search phrases and similar categories to each other, and those theme phrases will be automatically added to your query. Make your search for better topic. It also supports queries to the image. It can roam web, usenet, usnet FAQs, and more. Boolean operations are not supported, but symbols " " and "-" (equivalent to "and" NOT ") ---- Yahoo actually can't be called a search engine site, but it provides a hierarchy The subject index allows you to enter a specific topic from a usual topic, and Yahoo has a valid organization and classification for the web.
For example, if you want to build a web page, you don't know how to do it, in order to find information about the establishment of a web page in Yahoo, you can choose a topic on Yahoo: Computer and Internet, then under this topic, you can find some Sub-topic, such as: web page production, CGI programming, Java, HTML, web design, etc., select a sub-topic that you are looking for, and ultimately you can get links to all pages related to the sub-topic. That is, if you belong to which topic you want to find, it is very clear. The method of the directory query is better and accurate than the general usage search engine. You can search for Yahoo's index, but in fact, you are not searching for the entire web. But Yahoo provides options that allow you to search for other search engines, such as: Alta Vista. However, it should be noted that Yahoo is actually categorized and organized for a small part of the Web, and its effectiveness is not very good. ---- The basic principle of search engines is to crawl across the webpage through the network robot, then discover new web pages, take them back to the local database, and users' query requests can be obtained by querying local databases. If Yahoo will find about 5 million new web pages every day. ---- The implementation mechanism of the search engine generally has two, one is indexed by manual way, such as Yahoo's web page is implemented by manual classification, its disadvantage is that the coverage of the web is relatively low, while Can't guarantee the latest information. The query match is matched by the description and title of the keyword and the page written by the user, not by the full-text match. The second is to automatically index the web page, like Altavista, is fully implemented through an automatic index. This automatic document classification can actually use technology extraction. However, it may be better than manual classification in classification accuracy. ---- Search engine typically has a Robot regular access to some sites to check changes in these sites while finding new sites. General Site has a robot.txt file to illustrate the area where the server does not want Robot access, and Robot must comply with this rule. If it is an auto-index, Robot will return to a certain class after getting the page in the page, depending on its contents according to its content. The page information is saved by metadata. Typical metadata includes a title, an IP address, a brief introduction, keyword, or a date with the size of the file, and the final update date. Although metadata has a certain standard, many sites use their own templates. Document extraction mechanisms and index policies have a great relationship on the validity of the web search engine. Advanced search options typically include: Boolean methods or phrase matches and natural language processing. The results generated by a query are submitted to the user according to the extraction mechanism. The most relevant placement is in front. The metadata of each extracted document is displayed to the user. At the same time, the URL address where this document is located. ---- Also have some specialized engines about a subject, they only search and process the content of a topic, so that the information is relatively high. ---- At the same time, there is a type of search engine that itself does not have Robot to regularly collect web pages. The Savvysearch and Metacrawler are simultaneously issued by simultaneously inquiry to multiple search engines, and the results are simultaneously returned to the user. Of course, it is actually analyzed and comparing the functions of Savvysearch to individual search engines, and submitted to different search engines according to different user queries, of course, the user can specify which search engine that uses it. ---- A excellent search engine must handle the following questions: 1 page classification 2 Natural language processing 3 Search strategy scheduling and collaboration 4 Search for specific users.
So many search engines use some artificial intelligence techniques to solve these aspects. ---- II, Network Spider implementation description ---- There are many articles now have a lot of introductions and analysis on the web engine, but there are very few detailed descriptions for their implementation. Here we mainly introduce one. Implementation of a web engine with basic functions. In this article, we describe how the web engine collects web pages in the form of class C languages and stored in the database. Also describe how to query the database according to the keyword entered by the user and get the process of relevant web pages. ---- 2.1 Database Structure ---- First, we have to create a database table to store our page. The following table is generally required: ---- 1. Establishment of a dictionary table, in fact, it is a document that is meaningful in documentation and its appearance frequencies. ---- The table (WordDictionaryTBL) mainly includes three fields, mainly for storage and a web page related words URL_ID's unique ID number of each URL Word in this URL's Stem in this URL The number of words in the page in this page ---- 2. Store the table for each URL information - the main key fields in this table (URLTBL) are: REC_ID Each record is the only ID number STATUS The state of the URL content, such as http_status_timeout indicates the maximum allowable timeout URL URL URL URL URL URL Name Last_Modified Latest Change Time Title This URL Title DOCSIZE The size of the file of this URL Last_Index_time Last NEXT_INDEX_TIME Under the NEXT_INDEX_TIME Time tag for an index for web pages, used to represent its type, such as TEXT, or html, or pictures, etc. HOPS, the number of times when you get the file, the number of times of failwords for web pages, and this page related keyword Description For webpages, you mean the language used by the content of the web page. For example: About, in, AT, WE, THIS, etc. in English. In Chinese, "and", "together", "about" and so on. We are unified to call them a stop word (STOP Word). So we have to build a table to include all these stop words. This table (StopWordTBL) has two fields. Word Char (32) means those stop words lang char (2) indicating the language used ---- 4. We are going to build a table about Robot, we have said before, all websites generally have a robot.txt The file is used to indicate the permissions that Robot on the network can access. This table (RobotTBL) mainly has the following fields. The Hostinfo Web site host information PATH does not allow the directory of Robot access ---- 5. A table (for, for example, some content unhealthy or unnecessary to search for Search), main The field is the URL of the web page.
---- 6. In addition, we need to build a table of file types we have to get (FileTyPetbl), for example, for a simple web search engine, we may only need to get a suffix .html, htm, .shtml, and txt Type file. Other we are just simple to ignore them. The main field is the type and description of the file. ---- The content of the table about the stop word is what we have to achieve, according to the statistics of various languages, put those words unsatisfactory words. About document words, the contents of the URL and ROBOT are updated during the web page. ---- 2.2 Specific web access Algorithm Description ---- The acquisition step of the specific web page is: ---- We can set the maximum number of threads that our search program, then these threads can be at the same time Online search, find out the webpage that needs to be updated according to the information existing on the database (how to determine which pages need to be updated is a dedicated process, now there are many heuristic and intelligent algorithms, basically Modeling based on statistics. The simplest is of course setting a time range, and the page before a certain time range is re-searched), then determines if the webpage is in the mask table, if yes, This record is removed in the table of the URL. Otherwise, we go to the corresponding WWW site to get the file specified by the URL (here you need to pay attention to the characteristics of different URLs, you need to use different protocols, such as the FTP site to adopt the FTP protocol, for the HTTP site to use HTTP protocol , The news site should be used by NNTP protocols. You don't have to get its content, just modify the time that the last update is the current time. If the page has recently been modified, we have to get the page, and analyze its content, mainly including links related to it, add them to the corresponding database, and judge the various other users contained in the web page. Documents, such as text files, graphics files, sound files, and other multimedia files are the files we need. If so, add it to our responding database. At the same time, all meaningful words and their appearance are extracted according to the content of the web page, and put them in the corresponding database. In order to better describe this process, let's look at the main several objects and data structures associated with this process. Objects are mainly for three levels. The first layer is for the WWW server, the second layer is for each page, and the third layer is an index for each page. ---- 2.3 and implementation related major objects and functions Description The following structure is for a site.
Class CServer {The main attributes include: char * URL; // WWW site URL name char * proxy; // Name of the agent using char * Basic_Auth; // Perform basic HTTP authentication int proxy_port; // proxy port number INT Period; // The period INT NET_ERRORS; / / network connection is unchanged, INT MAX_NET_ERRORS; / / Can Allow the largest network error int ost_timeout; // Download the maximum latency Int maxhops allowed by the file; // Indicates URL Depending on the depth INT userobots; // is complied with the agreement Int BodyWeight; // in
.... / body> in .... / body> in---- The following is the main data member of the structure of the document: Class CNetDocument main attributes are: int URL_ID; // This URL ID number INT status; // Get the status of the document int size; // Document Size int Tag; // and the document related label, indicating that the document is HTML, TEXT, or other type int hops; // URL jump, the number of times a number of times the number of times of this document; // and the name of the URL related to this document Char * Content_type; // This content type char * last_modified; // The most recent update time char * title; // This document title char * last_index_time; // The last index time char * next_index_time; // Next index Time char * keywords; // The keyword char in this document CHAR * DESCRIPTION; / / The main method of this document is: FillDocinfo (...) // According to the database, get the document related information addherf (...) // Add to the web page The present new link URL deleteURL (...) // Removes an existing URL CANGETHISURL (...) // Determines if the configuration determines whether the page // below is based on different URLs, with different protocols Document NNTPGET (...) ftpget (....) Httpget (....) PARSEHEAD (...) // If the HTTP protocol is obtained, the analyte information ParsemainBody (...) // analyzes the body of the obtained document to analyze ServerResponseType (....) / / Get the server-side response message updateurldb (....) // updated data in the library}; ---- In fact, when we want to extract a web page, we must build a CNETDocument object, and then make this web page When analyzing, put the relevant content in the member variable of this CNETDocument. Here is the main data member of the structure of the full-text index: Class Cindexer {main attributes are: char * URL; // We have to handle the documentation related URL name int mwords; // We have set a web page in advance The number of words int nwords; // The number of words actually get the number of words INT SWORDS; // We have sorted the number of words Word * Word; // All words Char * BUF; // We assign space assigned to the document The main methods include: InitIndexer (...) // Performs initial settings and assigns parsegetfile (...) // Get a full-text index addword (...) // adds the web page to the Word array in the word array to INTODB (... .) / / About the information of the page full-text index}; Generally speaking, we only have a full-text index for both types of URLs, one is Text / HTML, and the other is Text / Plain.
The data structure of Word is as follows: typedef struct word_struct {int count; // The number of times the word INT code; // The normal form of the word may be eNCOURGING, and its normal form should be Encourage, this is actually It is a STEM for words. That is, we only take the trunk part of the word. CHAR * WORD; / / The content of the word} word; ------ The following structure is a data structure related to some links in the web page TypedEf struct href_struct {char * href; // This link name INT HOPS ; // The number of jumps occurs int Stored; / / whether it has been stored in the database} href; ---- All need to update and newly generated URLs are put in this structure, when it exceeds certain After the range, the database is deposited. ---- One data structure of the URL is as follows: TypedEf struct url {char * schema; // indicates that the URL is obtained, such as HTTP, FTP, NNTP, etc. Char * specificent; // The name of the host plus the path char * Hostinfo; / / The name of the host plus the relevant protocol port char * hostname; / / The host's name char * path; // At the host's specific path char * FileName; // File Name Char * Anchor; // Related Anchorint Port; // Protocol related port} URL; --- This is a data structure for some related properties of the URL. In fact, in the database, we store only the description of the webpage and index information for keywords for some text and HTML pages. We don't store the actual content of the web page. ---- Third, user query implementation Description ---- Implementation analysis of query requests submitted to users: ---- The user wants to query a certain aspect of information is generally by providing several related to this field. Keywords are made. - - Let's take a look at the relevant data structure and classes related to the user query: - - Below is the basic structure of the word and its weight: typedef struct word_weight_pair {char Word [Word_len]; int future; } Word_Weight_Pair; ---- The following class is mainly used to process and analyze the user's query: class cuserQuery {char m_userQuery [max_querylen]; // User's query expression cptrarray word_weight_col; // is the dynamics of structure word_weight_pair Array int m_maxreturnsum; // User hopes to return the number of web pages INT search_mode; cobarray m_returndoc; // is a dynamic array normalizeword (char * oneword) of the CNETDocument object; // Make words for words, ie STEM.Find (CHAR * odbcname); // performs database lookups and match}; --- The basic steps of system implementation are as follows: ---- 1. Analyze the query expressions entered by the user. In fact, we are described in the previous Spider search process to be described by keyword, each document can be represented as such a collection of :: =