Chinese search engine technology revealed: system architecture (four)

xiaoxiao2021-03-06 21

Source: e800.com.cn

Search Engine Index and Search For network spider technology and sorting technology, please refer to Author Other Articles [1] [2], here the Google Search Engine mainly introduces the search engine data index and search procedures. The index of the data is divided into three steps: the extraction of the web page, the identification of the word, and the establishment of the branch library. Most of the information on the Internet are existing in HTML format. For indexing, only text information is processed. Therefore, you need to extract the text content in the web page, filter out some script markers and some useless advertising information, and record the text of the text of the text [1]. The word identification is a very critical part of the search engine, identifies the words in the web page through a dictionary file. For Western information, different forms of identification words are required, such as: single multiple, past form, combination words, stems, etc., for some Asian languages (Chinese, Japanese, Korean, etc.) need to be sentenced [3]. Identify each word in the web page and assign unique WordID numbers to serve the module for the data index. The establishment of the standard library is the most complex part of the data index. Two guidelines are generally required: documentation and keywords. Document Guide Assigns a unique DOCID number of each webpage. According to the DOCID scheduling, how much WordID appears in this web page, the number of times, position, cascade format, etc., the location, cascade format, etc., forming a data list corresponding to WordID; The word sign is actually the counterbar of the document, according to the WordID standard, the word appears in those web pages (represented by WordID), appearing at each web page, location, cascade format, etc., forming WordID corresponding to DOCID List. For detailed data structures of index data, interested friends can see literature [4]. The search process is the process of satisfying the user's search request. The search keyword is input by the user. The search server corresponds to the keyword dictionary, and the search keyword is converted to WordID, and then get the Docid list in the branch, for the DOCID list Match the scan and WordID, extract the webpage that meets the condition, then calculate the correlation of the web page and keywords, returns the first k result according to the correlation value (different number of search results per page per page) is returned to the user. If the user views the second page or the number of pages, re-search, return the sort result in the web organization of K 1 to 2 * K to the user. The process is shown below:

转载请注明原文地址:https://www.9cbs.com/read-45538.html

9cbs

New Post(0)