Hu Zhaohui (Zhejiang University Computer Department)
Wang Haizhen (Ningbo Haifeng Plastic Co., Ltd.)
---- I. Introduction
---- With the rapid development of the Internet, people are increasingly relying on the Internet to find the information they need, but because the online information sources are mostly, what we often say "Rich Data, Poor Information" . So how to effectively find the information we need, it has become a key issue. In order to solve this problem, the search engine is born.
---- Now there is a lot of search engines on the Internet, more famous with Altavista, Yahoo, Infosek, Metacrawler, Savvysearch, and more. A lot of search engines have also been established, such as Sohu, Sina, Northern Net, etc. Of course, because they have not been established, the information searches are improved and improved.
---- Alta Vista is a very fast search engine, because it is powerful hardware configuration, making it a complex query. It is mainly based on keywords, and its roaming field has Web and usenet. Support Boolean, "OR", "NOT", while adding the most similar positioning "NEAR", allowing wildcards and "back" search (such as: You can find all Web sites to a page) ). You can decide whether to add weights to the search for the search, and look for them in the document. The advantages of being able to perform phrase queries rather than simple word queries are obvious, for example, we want to find a phrase "to be or not to be", if you just break them into words, these words belong to STOP Word. This query will not have any results, but it is easy to return to some results as a whole, such as information about Hamlet or Shakespeare. The score of the web page obtained by the system is based on your search phrases contained in the web page, which are determined by the distance between the document and the distance between the search phrases within the document. At the same time, you can translate the resulting search results into other languages.
---- Exite is called a "smart" search engine because it establishes a concept-based index. Of course, it so-called "intelligence" is based on flexible applications for probability statistics. It can simultaneously perform an index based on concepts and keywords. It can index web, usenet, and category ads. Support for Boolean operations such as "AND", "OR", "Not", and can also use symbol " " and "-". The disadvantage is that there is no specified web page size and format in the returned query results.
---- Infoseek is a simple but powerful index, one of its advantages is that there is an extensible classification for topic search. You can use the subject phrases of your search phrases and similar categories to each other, and those theme phrases will be automatically added to your query. Make your search for better topic. It also supports queries to the image. It can roam web, usenet, usnet FAQs, and more. Boolean operations are not supported, but you can use symbol " " and "-" (equivalent to "and" not ")
---- Yahoo actually can't be called a search engine site, but it provides a hierarchical theme index that enables you to enter a specific theme from a usual topic, Yahoo has a valid organization for web. Classification. For example, if you want to build a web page, you don't know how to do it, in order to find information about the establishment of a web page in Yahoo, you can choose a topic on Yahoo: Computer and Internet, then under this topic, you can find some Sub-topic, such as: web page production, CGI programming, Java, HTML, web design, etc., select a sub-topic that you are looking for, and ultimately you can get links to all pages related to the sub-topic. That is, if you belong to which topic you want to find, it is very clear. The method of the directory query is better and accurate than the general usage search engine. You can search for Yahoo's index, but in fact, you are not searching for the entire web. But Yahoo provides options that allow you to search for other search engines, such as: Alta Vista. However, it should be noted that Yahoo is actually categorized and organized for a small part of the Web, and its effectiveness is not very good. ---- The basic principle of search engines is to crawl across the webpage through the network robot, then discover new web pages, take them back to the local database, and users' query requests can be obtained by querying local databases. If Yahoo will find about 5 million new web pages every day.
---- The implementation mechanism of the search engine generally has two, one is indexed by manual way, such as Yahoo's web page is implemented by manual classification, its disadvantage is that the coverage of the web is relatively low, while Can't guarantee the latest information. The query match is matched by the description and title of the keyword and the page written by the user, not by the full-text match. The second is to automatically index the web page, like Altavista, is fully implemented through an automatic index. This automatic document classification can actually use technology extraction. However, it may be better than manual classification in classification accuracy.
---- Search engine typically has a Robot regular access to some sites to check changes in these sites while finding new sites. General Site has a robot.txt file to illustrate the area where the server does not want Robot access, and Robot must comply with this rule. If it is an auto-index, Robot will return to a certain class after getting the page in the page, depending on its contents according to its content. The page information is saved by metadata. Typical metadata includes a title, an IP address, a brief introduction, keyword, or a date with the size of the file, and the final update date. Although metadata has a certain standard, many sites use their own templates. Document extraction mechanisms and index policies have a great relationship on the validity of the web search engine. Advanced search options typically include: Boolean methods or phrase matches and natural language processing. The results generated by a query are submitted to the user according to the extraction mechanism. The most relevant placement is in front. The metadata of each extracted document is displayed to the user. At the same time, the URL address where this document is located.
---- Also have some specialized engines about a subject, they only search and process the content of a topic, so that the information is relatively high.
---- At the same time, there is a type of search engine that itself does not have Robot to regularly collect web pages. The Savvysearch and Metacrawler are simultaneously issued by simultaneously inquiry to multiple search engines, and the results are simultaneously returned to the user. Of course, it is actually analyzed and comparing the functions of Savvysearch to individual search engines, and submitted to different search engines according to different user queries, of course, the user can specify which search engine. ---- A excellent search engine must handle the following questions: 1 page classification 2 Natural language processing 3 Search strategy scheduling and collaboration 4 Search for specific users. So many search engines use some artificial intelligence techniques to solve these aspects.
---- II, network spider implementation description
---- There are now many articles to make a lot of introductions and analysis on the web engine, but there are very few detailed descriptions of their implementation. Here we mainly introduce a web engine with basic functions. In this article, we describe how the web engine collects web pages in the form of class C languages and stored in the database. Also describe how to query the database according to the keyword entered by the user and get the process of relevant web pages.
---- 2.1 Database Structure
---- First, we have to create a database table to store our page. Here is generally necessary to establish the following table:
---- 1. Establishment of a Dictionary table, in fact, here is a document in terms of meaningful words in documents and their appearance frequencies.
---- The watch (WordDictionaryTBL) mainly includes three fields, mainly for storage and a web page related words
URL_ID's only ID number for each URL
Word's Stem in the URL
INTAG This word appears in this page
---- 2. Store tables for each URL information
---- The main key fields in this table (URLTBL) are:
REC_ID's only ID number of each record
Status gets the status of the URL content, such as http_status_timeout
The maximum allowable timeout download web page
URL URL string name
Type of content_type content
Last_Modified latest changes time
Title's headline
Document size of DOCSIZE
Last_Index_time Time I have recently indexed
Next_index_time Time for the next index
Tag for web pages, used to represent its type, such as: it is text, or HTML,
Or pictures, etc.
HOPS has a failure time
Keywords For web, keywords related to this page
Description For webpages, referring to the content of the web page
Langua used by the LANG document
---- 3. Because there are many words in the web page to be some prepositions and tone helps or very common common words, they don't have much meaning. For example: About, in, AT, WE, THIS, etc. in English. In Chinese, "and", "together", "about" and so on. We are unified to call them a stop word (STOP Word). So we have to build a table to include all these stop words. This table (StopWordTBL) has two fields.
Word Char (32) means those stop words
Lang char (2) means the language used
---- 4. We want to build a table about Robot, we have said before, all websites generally have a robot.txt file to represent Robot on the network to access permissions. This table (RobotTBL) mainly has the following fields. Hostinfo Web Site Host Information
PATH does not allow the directory of Robot Access
---- 5. Establish a web page that we need to shield (for example, some content unhealthy or unnecessary sites), the main field is the URL of the web page.
---- 6. In addition, we need to build a table of file types we have to get (FileTyPetbl), for example, for a simple web search engine, we may only need to get a suffix .html, htm, .shtml, and txt Type file. Other we are just simple to ignore them. The main field is the type and description of the file.
---- The content of the table about the stop word is what we have to achieve, according to the statistics of various languages, put those words unsatisfactory words. About document words, the contents of the URL and ROBOT are updated during the web page.
---- 2.2 Specific web page access algorithm description
---- The acquisition step for the specific web page is this:
---- We can set the number of threads that our search program can be opened, then these threads can search online, they find out those web pages that need to update according to the information about the webpage in the database. How to determine which webpages need to be updated is a process worth studying, and now there are many heuristic and intelligent algorithms, basically based on statistical laws. The simplest is of course set a time range, before a certain time range The web page is re-searched), then determines if the webpage is in the mask table, if so, remove the record from the table on the URL. Otherwise, we go to the corresponding WWW site to get the file specified by the URL (here you need to pay attention to the characteristics of different URLs, you need to use different protocols, such as the FTP site to adopt the FTP protocol, for the HTTP site to use HTTP protocol , The news site should be used by NNTP protocols. You don't have to get its content, just modify the time that the last update is the current time. If the page has recently been modified, we have to get the page, and analyze its content, mainly including links related to it, add them to the corresponding database, and judge the various other users contained in the web page. Documents, such as text files, graphics files, sound files, and other multimedia files are the files we need. If so, add it to our responding database. At the same time, all meaningful words and their appearance are extracted according to the content of the web page, and put them in the corresponding database. In order to better describe this process, let's look at the main several objects and data structures associated with this process. Objects are mainly for three levels. The first layer is for the WWW server, the second layer is for each page, and the third layer is an index for each page.
---- 2.3 and implementation related major objects and functions Description The following structure is for a site.
Class Cserve {
The main properties are:
Char * URL; // WWW Site URL Name
Char * proxy; // The name of the agent used
Char * Basic_Auth; // Perform basic HTTP certification
INT proxy_port; // The port number of the agent
INT Period; // Rear indexing cycle
INT NET_ERRORS; // Network connection is not available
INT MAX_NET_ERRORS; / / / / Maximum network error allowed
INT read_timeout; // Download the maximum latency allowed by the file
Int maxhops; // means the depth of the URL can maximize
INT userobots; / / Follow the agreement in robot.txt
INT bodyweight; // Weight of words between
.... body>INT TitleWeight; // Weight of the word between
INT urlweight; // Weight of the word in the URL of the document
INT DESCWEIGHT; / /
Name = "description" content = "..."> Weight between words
INT keywordWeight; // in
The weight between words
---- The main methods are:
FindServer (); // Used to find if the server exists and can be connected
FillDefaultAtttribute () // is used to fill in the default} for all WWW servers;
The member variable in the above object is the setting of the parameters related to a site, and we have a default setting for all sites, but you can do some special settings for some sites. These settings can be set in the configuration file.
---- The following is the main data member about the structure of the document:
Class CNetDocument
The main properties are:
INT URL_ID; / / The ID number of this URL
int status; // Get the status of this document
INT size; // Dimensions of documents
INT TAG; / / and the document related to the document, indicating that the document is
HTML, TEXT or other types
INT HOPS; // URL number of jumps
Char * URL; / / The name of the URL related to this document
Char * content_type; // This type of content
Char * last_modified; // The most recent update time
Char * title; / / The title of this document
Char * last_index_time; // Time to last index
Char * next_index_time; // Time to the next index
Char * keywords; // The keyword in this document
Char * description; / / Description of this document
The main methods are:
FillDocinfo (...) // According to the database, get the document related information
Addherf (...) // Add to the new link in the page present
DeleteURL (...) // Delete an existing URL
CangetThisurl (...) // Decide whether to get the page according to the configuration
// The following three methods are based on different URLs, with different protocols to obtain documents.
NNTPGET (...)
Ftpget (....)
Httpget (....)
PARSEHEAD (...) // If it is the HTTP protocol, the analytical head information
Parsemainbody (...) // Analyze the subject of the obtained document
ServerResponseType (....) // Get the response message of the server side
UpdateURLDB (....) // Updated Database
}
---- In fact, we must establish a CNETDocument object when we want to extract a web page, and then analyze this web page, put the relevant content in the member variable of this CNETDocument. Below is the main data member of the structure of the full text index: Class Cindexer {
The main properties are:
Char * url; // We have to handle the name of the URL related URL
INT mwords; // We have set the maximum number of words set in advance
INT nWords; // The number of words actually get
INT swords; // We have sorted the number of words
Word * Word; // All words content
Char * buf; // We assigned to the documentation
The main methods are:
InitIndexer (...) // performs initial settings and assignments
PARSEGETFILE (...) / / Full-text index of the resulting web page
Addword (...) // adds the available words of the web page to the Word array
INTODB (....) / / Information about the full-text index of the web page
}
---- Before making web extract, we have to build a Cindexer object, which is mainly used to complete the content of the web page. Generally speaking, we only have a full-text index for both types of URLs, one is Text / HTML, and the other is Text / Plain. The data structure of Word is as follows:
TypedEf struct word_struct {
INT count; / / The number of words appear
INT code; / / The normal form of the word,
For example, words may be Encouragging, and its normal form should be
Encourage, this is actually a STEM for words.
That is, we only take the trunk part of the word.
Char * word; // The content of the word
} Word;
---- The following structure is a data structure related to some links in the web page
Typedef struct href_struct {
Char * href; // The name of the link
INT hops; // The number of jumps occurred
INT Stored; // Whether it is already stored in the database
} Href;
---- All need to update and newly generated URLs are put in this structure, and after its number exceeds a certain range, the database is deposited.
---- One data structure of the URL is as follows:
Typedef struct url {
Char * schema; // indicates that the URL is obtained by what protocol, such as HTTP,
FTP, NNTP, etc.
Char * specificent; // The name of the host plus the path
Char * Hostinfo; / / The name of the host plus the relevant protocol port
Char * hostname; // The name of the host
Char * path; // The specific path in the host
Char * filename; // Name
Char * anchor; // Related Anchor
INT port; // protocol related port
} URL;
---- This is a data structure of some of the relevant properties of the URL. In fact, in the database, we store only the description of the webpage and index information for keywords for some text and HTML pages. We don't store the actual content of the web page.
---- Third, user query implementation description
---- Implementation analysis of query requests submitted to users:
---- The information you want to query a certain aspect is generally carried out by providing several keywords associated with this area.
---- Let's take a look at the relevant data structures and classes related to user queries:
---- The following is the basic structure of the word and its weight:
Typedef struct word_weight_pair {
Char Word [Word_Len];
Int weight;
Word_Weight_Pair;
---- The following class is mainly used to process and analyze the user's query:
Class CuserQuery
{
Char m_userquery [max_querylen]; // User's query expression
CPTRARRAY WORD_WEIGHT_COL;
// is a dynamic array of structural Word_Weight_Pair
INT m_maxreturnsum; // User wants to return the number of pages
INT Search_Mode;
Cobarray M_Returndoc; // is a dynamic array about CNETDocument objects
NormalizeWord (Char * Oneword); // Together to words, STEM.
Find (char * odbcname); // Database search and match
}
---- The basic steps for system implementation are as follows:
---- 1. Analyze the query expressions input by the user. In fact, our representation of the document in the previous Spider search process is described by keyword, each document can be represented as such a collection
Among them: =
---- In fact, the document represented by the representation of the vector space.
---- We also use the query expressions entered by the user use the representation of the vector space. We believe that the order of the keyword entered by the user represents the extent of its importance, so for the word before position has a relatively high priority, and we have the least atoms for all the contents or words for all the content, and perform STEM Operation, as mentioned above: For example, words eNCOURGING are transformed into ENCOURGE format. Then remove those STOP WORD, such as IS, AS, and so on, these words are stored in the stopwordtbl table. Then put all the naturalized content into the dynamic array word_weight_col.
---- 2. For each element in the dynamic array word_weight_col, that is, the structure word_weight_pair (including the weight of the word), we can find and these words related records from the table WordDictionaryTBL, which should include all Words in Word_Weight_col.
---- Whether the web page is calculated and the query matches. The process of matching calculation is as follows: First we sort all records by URL address. Because it is possible to have several records, it is a URL, then score each web page, each recorded word weight is INITSCORE * Weight * Weight * Increment. Where INITSCORE is the reference score of each word, TOTALTIMES is the number of times the word appears in the web page, and weight is the word having different weights in different content segments (such as in the keyword segment, or the title segment, or Content segment, etc.). INCREMENT is the result in each of the words increased.
---- 3. Depending on the M_MaxReturnsum specified by the user, display the highest degree of matching degree to the top M_MaxReturnsum page.
---- Fourth, conclude