Chinese search engine technology revealed: system architecture

xiaoxiao2021-03-06 59

Today, on the one hand, on the one hand, it is inseparable from its new experience to people, and on the other hand, there is also a network node that provides various rich content for it. Before the Internet is popularized, people see the information first thought is that there is a lot of books in books, how do you think today? Perhaps many people today choose a more convenient, fast, comprehensive, accurate way - the Internet. You can sit at home to see the mouse to find the wanted information, which is only a dream before the Internet is not popular, but now it has become possible. Help you quickly find the target information through the entire Internet is more and more valued search engines. There are many technical information about search engines. About multi-side reports on search engine economies have also been covered with clouds, so Xiaobian does not want to talk about these feelings, just think in this "Chinese Search Engine Technology Unveiled "Series All of the Articles are completed to talk about the far-reaching impact of search engines on Xiaobian. I remember that I have a large number of free personal page space on the Internet in 2000. At that time, Xiaobian was just a child just entered IT circle, and looked at these spaces called a saliva, so I applied for one. After more than a month, I have been studying and the three revolutions, I have born the first personal home page of my life. It can be seen at a very few visits every day, and the heart is a uncomfortable, but I can't think of a good way to solve the problem in a time. Suddenly I found an article on how to register my own personal website in the search engine, so Xiaobian registered his personal homepage in the relevant section of Sohu, NetEase, other search engines, respectively. Until today, Xiaobian did not know how the search engine that was prevalent in the "directory search engine". This is actually the first time I used, I know the search engine, and then I feel the magic of the search engine through the number of numbers of the search engine through the number of numbers of the personal home page. In fact, it is because of the search engine, the small-series personal home page is more familiar with, so that there will be many jobs because of the opportunities brought about by this personal page. In fact, these experiences may have a small experience, and there are many people, so they will go to the Internet work. This is like the "world is so wonderful, don't see you don't know", Xiaobian add a sentence "how to see, search engines help you!" The Internet has developed rapidly in nearly 10 years, and the Internet is gradually Go deep into people's lives and change people's lives. The Internet economy has also experienced wind and rain, starting from slowly to rapid expansion, from the foam to gradually pick up; from "online advertising" to "thumb economy", from "online game" to "Search Economy". At present, search engines become one of the most focused focus, and also become a cradle of millions of rich manufacturing. More and more companies want to dig into the basket in this gold mite in the search engine, many of which will choose to have their own search engines. Domestic famous search engine company Baidu (http://www.baidu.com) President Li Yanhong said: Search engine is not a field that everyone can do, and the threshold entry is relatively high. How high is the threshold for search engines? The threshold for search engines mainly technical thresholds, including fast acquisition of web data, regular mass data index, storage, search results, millisecond requirements of search efficiency, distributed processing and load balancing, natural language understanding techniques, etc. Wait, these are the threshold of the search engine. For a complex system, the techniques of all aspects are important, but the architecture design of the entire system is also not ignored, and the search engine is no exception. The technical foundation of search engine technology and classification search engines is the full text search technology. From the 1960s, foreign retrieval technologies have begun to study. The full text search usually refers to the full text of the text, including information storage, organization, performance, inquiry, access, etc., its core is the index and retrieval of text information, generally for enterprises and institutions.

With the development of Internet information, the search engine has gradually developed in full text search technology, and has been widely used, but the search engine is still different from full-text retrieval. The main difference between the search engine and the regular sense has the following points: 1. The data volume is a full-text retrieval system facing the data or related data related to the enterprise itself. The general index library is mostly in GB level, the amount of data is large There are only millions of people; but Internet web search needs to handle billions of web pages. The strategy of search engines is the server cluster and distributed computing technology. 2, the content correlation information is too much, the quotation and sorting is especially important. Google and other search engines use web link analysis technology. According to the Internet, the number of links to the Internet is used as an importance evaluation; but the full-text retrieval data source is linked The extent is not high, and it cannot be used as a basis for discriminating importance. It can only be sorted based on content. 3, the data source of the security Internet search engine is the information disclosed on the Internet, and other information is not important in addition to the text body; however, the data sources of the company's full text are information, level, permission, etc. Restrictions, there are more stringent requirements for query methods, so their data is generally safe and concentrated in data warehouses to ensure data security and management. 4, personalized and intelligent search engines are the Internet visitors, due to their data quantity and customer quantity limit, natural language processing technology, knowledge retrieval, knowledge mining and other computational intensive intelligent computing technology is difficult to apply, this is also currently The direction of search engine technology; In addition to the above differences, search engines and full-text retrieval have formed three different types: Full text search search engine: Full-text search engine is a veritable search engine, foreign representative has Google (http: / / www.google.com), Yahoo (http://search.yahoo.com), alltheweb (http://www.alltheweb.com), etc., domestic famous Baidu (http://www.baidu.com ), Search (http://www.zhongsou.com). They are all related records that match the user query conditions, and then return the results to the user according to certain arrangement order, and currently regularly Search engine in the sense. Directory Search Engine: Directory Index Although there is search function, it is not a real search engine in strict sense, just a list of website links classified by the directory. Users can do not use keyword queries, only by classified directorys can also be found. Abroad is more famous directory index search engines have Yahoo (http://www.yahoo.com) Open Directory Project (http://www.dmoz.com/), Looksmart (http: //www.looksmart. COM), etc. Domestic Sohu (http://www.sohu.com), Sina (http://www.sina.com), NetEase (http://www.163.com) Search also has this type of function. The meta search engine: The meta search engine is searched on other multiple engines while accepting user query requests and returns the result to the user.

The famous meta search engine has Dogpile (http://www.dogpile.com), Vivisimo (http://www.vivisimo.com), etc. //www.soseen.com/), Youke Search (http://www.yok.com). In terms of search results, some are arranged directly on the source engine, such as dogpile, and some will rearrange the results according to the custom rules, such as Vivisimo. Other search engines such as Sina (http://search.sina.com, NetEase (http://search.163.com), A9 (http://www.a9.com) is called other The full text retrieves the search engine, or has dual development on the basis of its search results. Search Engine System Architecture This mainly describes the system architecture of the full text search engine. The search engine mentioned below does not have a special instructions. The implementation principle of the search engine, can be seen as four steps: from the Internet to grab the web page → establish the index database → Search in the index database → Processing and sorting the search results. 1. From the Internet, the webpage can be used to automatically collect web spider programs from the Internet, automatically access the Internet, and climb all the URLs in any web page to other web pages, repeat this process, and climb all the webpages Collect the server. 2. Establish an index database to analyze the collected webpage by the index system program, extract related web information (including the URL of the web page, the coding type, page content containing keywords, keyword position, time, size, with other webpages Link relationship, etc.), a large amount of complex calculations according to a certain correlation algorithm, obtain each web page for the correlation between the page content and each keyword in the hyperlink, and then establishes a web index database with these related information . 3. Search in the index database After the user enters keyword search, break down the search request, find all relevant web pages that match the keyword from the web index database. 4. Sort search results All related web pages For this keyword, the relevant information is recorded in the index library, just consolidate related information and web level formation correlation value, then sort, the higher the correlation, the ranking Before. Finally, the page generating system will organize the link address of the search results and the page content summary to the user. The following figure is a typical search engine system architectural diagram, and each part of the search engine is interlaced with each other. The processing process is described below: "Network Spider" grabbed the web page from the Internet, send the webpage into the "Page Database", "extract URL" from the web page, send URL to "URL Database", "Spider Control" to get the web page URL, control "network spider" to grab other web pages, repeatedly cycle until all the webpages are caught. The system gets text information from the "Web Database", and the "Text Index" module establishes an index to form an "index database". At the same time, "link information extraction" is performed, and the link information (including anchor text, link itself, etc.) is sent to the "link database" to provide a basis for the "web rating". "User" gives "Query Server" by submitting the query request, the server performs the search in the "Index Database", and "Web Rating" combines the query request and the link information to relevance to the search results, through " Query Server "Sort by correlation and extracts the content summary of keywords, and the final page returns to" User ".

Search Engine Index and Search For network spider technology and sorting technology, please refer to Author Other Articles [1] [2], here the Google Search Engine mainly introduces the search engine data index and search procedures. The index of the data is divided into three steps: the extraction of the web page, the identification of the word, and the establishment of the branch library. Most of the information on the Internet are existing in HTML format. For indexing, only text information is processed. Therefore, you need to extract the text content in the web page, filter out some script markers and some useless advertising information, and record the text of the text of the text [1]. The word identification is a very critical part of the search engine, identifies the words in the web page through a dictionary file. For Western information, different forms of identification words are required, such as: single multiple, past form, combination words, stems, etc., for some Asian languages (Chinese, Japanese, Korean, etc.) need to be sentenced [3]. Identify each word in the web page and assign unique WordID numbers to serve the module for the data index. The establishment of the standard library is the most complex part of the data index. Two guidelines are generally required: documentation and keywords. Document Guide Assigns a unique DOCID number of each webpage. According to the DOCID scheduling, how much WordID appears in this web page, the number of times, position, cascade format, etc., the location, cascade format, etc., forming a data list corresponding to WordID; The word sign is actually the counterbar of the document, according to the WordID standard, the word appears in those web pages (represented by WordID), appearing at each web page, location, cascade format, etc., forming WordID corresponding to DOCID List. For detailed data structures of index data, interested friends can see literature [4]. The search process is the process of satisfying the user's search request. The search keyword is input by the user. The search server corresponds to the keyword dictionary, and the search keyword is converted to WordID, and then get the Docid list in the branch, for the DOCID list Match the scan and WordID, extract the webpage that meets the condition, then calculate the correlation of the web page and keywords, returns the first k result according to the correlation value (different number of search results per page per page) is returned to the user. If the user views the second page or the number of pages, re-search, return the sort result in the web organization of K 1 to 2 * K to the user. Its processing flow is shown below: Search engine refinement trends have become more and more detachably in the search engine market space. There is no country in the Internet, and Li Yanhong, Baidu President, said: The search engine market is the market of winners. If a search engine wants to have its own place in the search market, you must have your own characteristics. Moreover, hundreds of millions of netizens, search requirements are impossible, different types of users need different types of search engines, web search is just one of the search needs, which determines that the search engine will continue to refine, each Featured search engines have also emerged. Technically, various search engines have a similar system architecture, which differ from the different data sources. In addition to the web search engines mentioned above, the following is listed in the following typical search engines: News Search Engine See News is the main purpose of many netizens Internet access, news search has become an important tool for viewing news. News Search Engine is relatively simple, generally scan famous news websites at home and abroad, grabbing the news page, build your own news database, then provide search, just the frequency requirements of the news webpage, some need to do it Scan a few minutes.

转载请注明原文地址:https://www.9cbs.com/read-87893.html

9cbs

New Post(0)