In this article, we introduced Google, which is a prototype of the large search engine (of a large search engine, and the search engine is widely used in hypertext. Google's design can efficiently grabbiz and build an index, and its query results are high than other existing systems. The full text of this prototype and the superconnected database contain at least 24'000'000 web pages. We can download from http://google.stanford.edu/. The design search engine is a challenging job. The search engine has indexed an index for hundreds of millions of web pages, which contain a lot of very different vocabulary. And answer thousands of queries every day. In the network, although the large search engine is very important, the academic circles rarely study it. In addition, due to the rapid development of technology and a large number of web pages, it is now created a search engine and three years ago. This article details our large search engines, according to our knowledge, in public published papers, this is the first description in detail. In addition to the problems encountered in such a large-scale web page, there are many new technical challenges, including additional information in the application hypertext to improve search results. This article will solve this problem, describe how to use additional information in hypertext to establish a large-scale utility system. Anyone can publish information on the Internet, how to effectively handle these una-organized hypertext collection, and is also a matter of concern here. Keywords World Wide Web, Search Engine, Information Retrieval, PageRank, Google 1 Introduction Web brings new challenges to information retrieval. The amount of information on the Web grows rapidly, and there is no experience in experience with no experience to experience the art of Web. People prefer to surf online with hyperlinks, usually start with web pages or search engines like Yahoo. Everyone thinks that the List (catalog) effectively contains the topic of interest, but it has the same subjectivity, the cost of establishing, and maintenance, slow upgrade, and cannot include all the themes of all Olympics. Key words-based automatic search engine usually returns too much low quality match. Make the problem more, some ads are misleading the automatic search engine in order to win people's concern. We have established a large search engine to solve many problems in existing systems. Applying hypertext structures, greatly increase the quality of query. Our system is named Google, named Googol's popular spelling, that is, 100 times of 10, this is to build a large search engine with our goal. 1.1 Network Search Engine - Upgrade (Scaling Up): 1994-2000 Search Engine Technologies have to quickly upgrade (scale Dramatical "is multiplying a multiplier web quantity. In 1994, the first Web search engine, World Wide Web Worm (WEB WEB WORM (WEB) can retrieve 110,000 web pages and web files. By November 1994, the top search engine claims to retrieve 2'000'000 (WebCrawler) to 100'000'000 network files (from Search Engine Watch). It can be foreseen that in 2000, the recordable web page will exceed 1 '000'000'000. At the same time, the number of visitors of search engines will also grow at an amazing speed. In March 1997, the World Wide Web Worm received 1,500 queries a day. In November 1997, Altavista claims that it will handle approximately 20'000'000 queries a day. With the growth of network users, in 2000, the automatic search engine will handle hundreds of millions of queries every day.
Our system design goals should solve many problems, including quality and upgradeability, introducing SCALING SEARCH Engine Technology, upgraded it to such a large amount. 1.2 Google: Follow the WEB's pace (Scaling WITH WIB) to build a search engine that can be adapted to today's Web scale faces many challenges. Grasping web technology must be fast enough to keep up with the speed of the web page (Keep The Up to Date). Spaces that store indexes and documents must be large enough. The index system must be able to effectively process hundreds of billions of data. Handling queries must be quickly processed to handle hundreds of queries per second (Hundreds to Thousands per second.). As the web is growing, these tasks become more and more arduous. However, the implementation efficiency and cost of hardware is also rapidly increased, and these difficulties can be partially offset. There are several factors worth noting, such as DiSK Seek Time, Operating System Robustness. In the process of design Google, we consider the growth rate of the Web and considering the technology update. Google's design can be upgraded to process massive data set. It can effectively use storage space to store indexes. Optimized data structures can be accessed quickly (refer to Section 4.2). Further, we hope that the cost of storing and establishing an index relative to the amount of text files and HTML web pages are small (refer to Appendix B). For a centralized system like Google, these measures take these measures to get a satisfactory system-canable system. 1. 3 Design Goals 1.3.1 Improve search Quality Our main goal is to improve the quality of the web search engine. In 1994, some people think that the full search index (a completion search index) can make it easy to find any data. Based on the Web 1994 - Navigators, "The best navigation service can make any information on the web is easy (all data can be logged in)". " However, 1997 Web is very different. Recently, the search engine users have confirmed that the integrity of the index is not the only standard for evaluating search quality. Search results for users are often incompetent in "Junk Result". In fact, until November 1997, only one of the four commercial search engines can find itself (there is it in the top ten results returned when you search your name). The main reason for this problem is that the number of indexes of the document has increased several orders, but the number of documents that the user can see is not increased. Users still only want to see dozens of search results in front. Therefore, when the collection increases, we need tools to accurately (in the previous dozens of results), the number of documents). Because it is selected from thousands of a bit related documentation, in fact, the relevant concept refers to the best documentation. High precision is very important, even responded (the total number of documents that can return) is available). It is pleased that the information provided by hypertext link helps improve search and other applications. In particular, link structures and link text provide a lot of information for correlation judgment and high quality filtration. Google uses the link structure and uses ANCHOR text (see Section 2.1 and 2.2). 1.3.2 Academic research of search engines over time, in addition to development, web is more and more commercialization.
In 1993, only 1.5% of Web services came from .com domain. By 1997, it exceeded 60%. At the same time, the search engine enters the business from the academic field. In most search engines, most search engines are all owned by the company. This leads to the search engine technology to be largely covered, and tend to advertise (see Appendix A). The main goal of Google is to promote the development of academic fields in this regard and understand it. Another design goal is to give you a practical system. Applications are very important to us because there are a lot of useful data in modern network systems (US Because We Think Some of the Most Interesting Research Will Involve Leveing The Vast Amount of Usage Data That Is Available from Modern Web Systems). For example, tens of millions of studies per day. However, it is very difficult to get these data, mainly because they have no business value. Our final design goal is to establish a system that supports new research on massive web data. In order to support new research, Google saves the actual document in the form of compression. One of the goals of design Google is to build an environment to enable other researchers to enter this area, handle massive web data, resulting in satisfactory results, and it is difficult to obtain results through other methods. The system is built in a short time, and several papers have used the database built by Google, more in the start. Another goal is to build a universe space laboratory environment, and researchers here can even design or do some experiments for our massive web data. 2. The system feature The Google search engine has two important features to help get high-precision search results. The first point, the application web link structure calculates the RANK value of each web page, called PageRank, will describe it in page 98. Second, Google uses hyperlink to improve search results. 2.1 PageRank: The reference (link) map for web page sorting the web is important resources, but it is largely ignored by today's search engine. We have established a chart containing 518'000'000 hyperlinks, which is a sample with important meaning. These pictures can quickly calculate the pageRank value of the web page, which is an objective standard, which is better in compliance with an important evaluation of a web page in people's minds. So in the Web, PageRank can optimize the results of keyword queries. For most topics, use PageRank optimized simple text in web header queries, we get amazing results (from Google.stanford.edu to get a demo). For the full-text search in Google main system, PageRank also helped a lot. 2.1.1 Calculating the reference theory in the PageRank literature search to the web, reference the number of links to the webpage, to some extent, to a certain extent, reflect the importance and quality of the page. PageRank has developed this idea that links between web pages are inequal. PageRank is defined as follows: We assume that T1 ... TN points to web a (eg, referenced). The parameter D is a brake factor that makes the result between 0, 1. Typically D equal to 0.85. In the next section, the D is described in detail.
C (a) is defined as the number of links to other web pages, and the PageRANK value of web page A is given by the following formula: Pr (a) = (1-D) D (Pr (T1) / C (T1) . .. Pr (TN) / C (TN)) Note PageRank's form, distribute into each web page, so all web pages PageRank and are 1. PageRANK or Pr (a) can be calculated by a simple iterative algorithm, the main feature vector of the respective specifications web link matrix. Medium-sized website calculates 26'000'000 page PageRank value for hours. There are also some technical details beyond the scope of this article. 2.1.2 Intuition Judgment PageRank is viewed as a model of user behavior. We assume that the online surfing is random, keeping the link, never returning, eventually annoying, and then choose a web page to restart surfing. The possibility of random access to a web page is its PageRank value. Brake Factor D is randomly accessed a web page, and then selects a web page. An important variable is added to the brake factor D for a single web page or a set of web pages. This allows individuals to deliberately mislead the system to get a high PageRank value. We also have other PageRank algorithms, see 98 pages. Another intuitive judgment is a webpage with a lot of web pages, or some page of PageRank values point to it, this web page is important. Intiively, in the web, a web page is referenced by a number of web pages, then this page is worth seeing. A web page is called like Yahoo, even if it is once, it is worth seeing. If the quality of a web page is not high, or the dead link, the home page like Yahoo will not be chained to it. PageRank has handled these two factors and passed through the network link. 2.2 Link Description Writing Our search engine has special processes for link text. Most search engines are linked to the link text and the page it's the link is on the page. In addition, link it and the link to which the link is pointed. This has a few benefits. First, usually the link description text is more accurately described more accurately than the page itself. Second, link description text may be retrieved by text search engines, such as images, programs, and databases. It is possible that the returned web page cannot be caught. Pay attention to what you can't bring some questions. It does not detect their validity before returning to the user. This scenario engine may return a web page that does not exist at all, but there is a hyperlink pointing to it. However, this result can be picked, so this problem rarely occurs. The link describing the text is a propaganda of the chained web page. This idea is used in World Wide Web Worm, mainly because it helps search for non-text messages, and enlarges search range with a small number of downloaded documents. We have a lot of link description text because it helps to improve the quality of the search results. Effectively utilize link description text technology, because a large amount of data must be processed. Now we can catch 24'000'000 web pages, have retrieved 259'000'000 multiple link description text. 2.3 Other Features In addition to PageRank and Application Link Description text, Google has some other features. First, all HIT has location information, so it can widely apply neighborhood (Proximity) in the search. Second, Google Tracks some visual appearance details, such as a font size. The black body is more important than other characters. Third, the knowledge inventory has stored the original full-text HTML page. 3 The history of working Web search research is short. World Wide Web Worm () is one of the earliest search engines.
Later, some search engines for academic research were there, most of them were owned by listed companies. Compared with the growth of the Web and the importance of search engines, there is quite small for today's search engine technology. According to Michael Mauldin (Chief Scientist)), "A variety of services (including Lycos) are very concerned about the details of these databases." Although there is a lot of work in some characteristics of the search engine. It has a representative work, and the results of the existing business search engine are transmitted, or a small personalized search engine is established. Finally, there are many research on information retrieval systems, especially in the Well Controlled Collections. In the following two sections, we will discuss which areas in the information retrieval system need to improve to better work on the web. 3.1 Information Search Information Retrieval System Birth is a few years ago and developed quickly. However, most information retrieval system research is a collection of small-scale single organized structures, such as scientific papers, or related topics news stories. In fact, the main benchmark for information retrieval, the text retrieval conference (), with a small-scale, organized structure, as a reference. The large-scale collection of only 20GB is only 20GB, compared to the 2400,000 web pages we caught by 147GB. A well working system on TREC is not necessarily good on the web. For example, a standard vector space model attempts to return and query the most similar document, regarding query requests and documents as a vector that consists of words that appear in them. In a web environment, this strategy often returns a very short document that is often a query term plus a few words. For example, query "Bill Clinton", the returned page contains "Bill Clinton SUCKS", which is seen from a primary search engine. Some disputes on the Internet, users should express what they want to inquire more accurately, with more words in their query requests. We strongly oppose this view. If the user raises a query request such as "Bill Clinton", it should be achieved as the result of the query because this topic has many high quality information. The example given, we believe that the information retrieval criteria need to develop in order to effectively handle Web data. 3.2 Tissue Structure Collection (Well Controlled Collection) and the different point of Web of the Web are a collection of completely organized heterogeneous documents. Documents in the Web have a large amount of isomality regardless of the inner information or implies information. For example, there is a different language inside the document (both human language and programs), vocabulary (email address, link, postal code, telephone number, product number), type (text, HTML, PDF, image, sound), Some even the file created by the machine (log file, or the output of the database). It can be inferred from the document, but not included in the document is called implied information. Including information includes reputation, update frequency, quality, access, and reference. Not only maybe inform information, but also the information detected is greatly different, and the difference can reach several orders. For example, an important home page is used, like Yahoo's daily navigation reaches millions, which may be accessed once more than ten years more than ten years. Obviously, the search engine is different for the treatment of these two types of information.
Another obvious difference between the Web and the organizational structure is, in fact, there is no restrictions to the web upload information. Flexible use This point can publish any information about the search engine, so that the routing is blocked, plus the intentional manipulation of the search engine, which has become a serious problem. These issues have not been raised by the traditional closed information retrieval system. It cares about the effort of metadata, which does not apply in the web search engine, because any text in the web page does not claim to manipulate the search engine. Even some companies have specially manipulated search engines for profit. 4 System Anatomy First, we provide a high level of discussion of the architecture. Then, the important data structure is described in detail. Finally, the main application: Grasping webpages, indexing, and search will be strictly checked. Figure 1. High Level Google Architecture 4.1google Architecture Overview This section, we will see how the entire system works (Give a high level), see Figure 1. This section does not discuss the application and data structure, discussed in the latter section. For most efficient, Google is implemented with C or C , which can be run in Solaris or on Linux. In the Google system, the grip page (download web page) is done by several distributed crawlers. A URL server is responsible for providing a list of URLs to Crawlers. Catch the web page to the Store Server StoreServer. Then, the web page is compressed by the storage server and exist them to the Knowledge Base Repository. Each page has an ID called Docid, and when the new URL is analyzed from the web page, it is assigned a DocID. The index index function is established by the indexer and the nickname. The indexer reads documents from the knowledge base to decompress and analyze them. Each document is converted into a group of words, called hits HITS. Hits records, words in the document, closest the font size, upperclicition. The indexer assigns these HITS into a barrel Barrel to generate partial sorted indexes. Another important function of the indexer is to analyze all links in the web page, and there is a link to describe the Anchors file in the link. This file contains sufficient information that can be used to determine the information of each link link to the node, and link text. The URL Decomposer Resolver read the link describes the anchors file and converts the relative URL into an absolute URL, and then converts to a DOCID. To describe the text to describe the text, and associate with the DOCID it points to. The link database consisting of DOCID is also established. Used to calculate the PageRANK value of all documents. Barrels after classifying the DOCID, give the nickster, then classify according to WordID, establish a reverse index inverted index. This operation should be just right so almost no temporary space is required. The nickname also gives a Docid and an offset list to establish a reverse index. A program called Dumplexicon combines this list and a dictionary generated by the indexer to create a new dictionary for search. This search is to use a web server that uses the DUMPLExicon generated by Dumplexicon, using the above-described reverse index and page level PageRank to answer users' questions. 4.2 The main data structure has been optimized to grab a large number of documents, establish an index and query with a smaller cost. Although the CPU and the input and output rates rapidly increase in recent years. The disk seek still needs 10ms. The design of the Google system is as designed to avoid the disk as much as possible. This has a great impact on the design of the data structure.
4.2.1 Big file large files Bigfiles refers to the multi-file system generated by the virtual file, with a length of 64-bit integer data addressing. The spatial allocation between multi-file systems is done automatically. The Bigfiles package also processes allocated and unassigned file descriptors. Because the manipulation system does not meet our needs, Bigfiles also supports basic compression options. 4.2.2 Knowledge Base Figure 2. Repository Data Structure Knowledge Base contains all HTMLs for each web page. Each page is compressed with ZLIB (see RFC1950). The choice of compression technology is necessary to consider the rate of compression. We choose ZLIB's speed rather than BZIP with high compression ratio. The compression ratio of the Knowledge Base is close to 4: 1. The compression ratio of ZLIB is 3: 1. A document is stored in a knowledge base, the prefix is DOCID, the length, the URL, see Figure 2. Accessing the Knowledge Base does not require other data structures. This helps data consistency and upgrade. Reconstructing the system with other data structures, we only need to modify the knowledge base and Crawler error list file. 4.2.3 File Index File Index Saves some information about the document. Indexes are arranged in the order of docid, and the ISAM (INDEX Sequential Access Mode) is widened. Each record includes the current file status, a pointer, file check, and various statistics to the knowledge base. If a document has been caught, the pointer points to the Docinfo file, the width of the file, contains the URL and title. Otherwise, the pointer points to the list of URLs that contain this URL. This design takes into account a concise data structure and can access one record in the query only one disk seek time. Another file is used to convert the URL to DOCID. It is a list of URL checksums and corresponding DOCIDs, sorted by checksum. To know a URL Docid, you need to calculate the URL checks, and then perform binary looks in the checksum file, find its docid. By merge this file, a batch of URLs can be converted into a corresponding DOCID. URL analyzer uses this technology to convert the URL to DOCID. This model is critical, otherwise each link requires a query. If you use a disk, 322'000'000 links will take more than a month. 4.2.4 Dictionary Dictionary has several different forms. And the importance of the previous system is that the dictionary of the dictionary can be in a reasonable price. Now the system, a 256M memory machine can put the dictionary into the memory. The current dictionary contains 14000000 vocabulary (although some little vocabulary is not added to the dictionary). It performs two parts - vocabulary (continuous strings separated by NULL) and a hash table. Different functions, vocabulary has some auxiliary information, which exceeds the scope of this article. 4.2.5 Hit List Hit List is a list of words that appear in the document, including location, font size, uppercase. Hit List accounts for large spaces in forward and reverse indexes. Therefore, its representation is more effective and better. We consider several schemes to encode location, font size, uppercase - Simple encoding (3 integer), compact code (support optimized allocation ratios), Hafman code. The details of HIT are shown in Figure 3. Our compact coding each HIT uses 2 bytes. There are two types of HIT, special HIT and ordinary HIT. Special HIT includes URL, title, link description text, meta tag. Ordinary HIT contains everything else. It includes a case-in-case feature, a font size, 12 bits for describing the position in the document (all of the 4095 positions are 4096).
The font size is represented by the relative size relative to the other parts of the document (actually only 7 values, because the 111 flag is special HIT). Special HIT is based on the upper and lower sketch, the font size is 7 indicates that it is special HIT, indicating the type of special HIT with 4 bits, and 8 bits represent the position. The 4 bits are used to represent the location in Anchor, and 4 bits are used to indicate the hash table hash of the docid in the ANCHOR. The phrase query is limited, and there is not enough ANCHOR for some words. We want to update the storage of Anchor Hit to resolve the problem of insufficient address bit and DocidHash domain. When searching, you will not treat it because the font size of the document is larger than the other documents, so use the relative font size. The length of the HIT table is stored before HIT. In order to save space HIT table length, in the forward index and WordID, in the reverse index and DOCID binds. This limits it only 8 to 5 bits (with some techniques, can borrow 8bit from WordID) if it is greater than these bits can be expressed, with overflow code fill, the two bytes are true length. Figure 3. Forward and Reverse Indexes and the Lexicon 4.2.6 Positive Index is actually, the forward index is partially sorted. It is present in a certain number of Barrel (we use 64 Barrels). Each Barrel is fitted with a range of WordId. If the words in a document fall into a Barrel, its Docid will be recorded in this Barrel, follow the words (all the vocabulary in the document, or the vocabulary in this Barrel) corresponding to the corresponding HitList. This mode requires a slightly more storage, because a Docid is used multiple times, but it saves the number of buckets and time, and finalize the complexity of the coder when indexing. Further measures are that we are not to store the Docid itself, but store the difference in the smallest DOCID relative to the bucket. With this method, the Docid of unsorted Barrel is only 24, and the 8-bit record HitList is long. 4.2.7 Reverse index removes the reverse index to include the same bucket with the forward index. For each valid DOCID, the dictionary contains a pointer to the barrel of the word. It points to DOCLISH consisting of DOCID and its corresponding HitList, which represents all documents that contain the word. The order of Docid in DoCLIST is an important issue. The easiest solution is to sort with DOCLISH. This method is very fast when merging multiple words. Another alternative is sorted by the number of times the word in the document. This method answers the word query, and the time used is negligible. When a preparation query is almost starting from the beginning. And it is very difficult to improve the index with other RANK algorithms. We integrate these two ways to establish two groups of reverse index Barrel, a group of Barrels's HitList contains only headings and Anchor Hit, and another group of Barrel includes all HitList. We first check the first set of index buckets, see if there is no match, then check the large group of barrels. 4.3 Grasping webcings Running a network crawling robot is a challenging task. Performance and reliability or even more important, there are some social focus. Network crawling is a very weak application that requires hundreds of thousands of web servers and various domain name servers, which are not allowed by our system. In order to cover up billions of web pages, Google has a fast distributed network crawler. A URL server provides a list of URLs to several network crawled robots (we use 3). URL servers and network crawling robots are implemented with Python. Each network crawler can open 300 links at the same time. Grabbing the webpage must be fast enough.
When the fastest, crawl can crawl 100 web pages per second with 4 network crawled robots. The rate is 600K per second. The focus of execution is to find DNS. Each network reptile robot has its own DNS Cache, so it doesn't have to check DNS every web page. There are several different states per 100 connections: check DNS, connect the host, send request, and receive an answer. These factors make the network reptile robots become a complicated part of the system. It treats events with asynchronous IO, and several request queues will crawl from a website to another website. Running a web page that links to more than 5 million servers, producing more than 10 million landing ports, resulting in a lot of email and phones. Because there are many people, some people don't know what the network crawling robot is, this is the first network crawler that they see. Almost every day, we will receive such email "Oh, you have seen too many web pages from our website. What do you want?" And some people don't know the Robots Exclusion Protocol, thinking them. The word "copyright, do not index" is written on the web page will be protected, and it is difficult to understand the Web Crawler. Because the amount of data is so large, you will have some unexpected things. For example, our system has an attempt to catch a online game, and the result caught a lot of spam in the game. Solving this problem is simple. But we downloaded tens of millions of pages and found this problem. Because there are many kinds of web pages and servers, it is impossible to test a webpage crawling robot on most of the INTERNET. There are always hundreds of implicit problems that occur on a web page of the entire web, causing network crawling robots to crash, or worse, leading to unpredictable incorrect behavior. The system that can access most of the Internet must be energetic and carefully tested. Since the large-scale complex system like Crawler always produces such a problem, some resources are expected to read these email, and it is necessary when the problem solves it. 4.4Web Index Analysis - Any analyzer running on the entire Web must be able to handle a large collection that may contain errors. The range from HTML marked between several k bytes between the tag, non-ASCII characters, hundreds of layers of HTML tag nested, and various unimaginable errors. In order to get the maximum speed, we did not use YACC to generate context-free CFG analyzers, but use flexible way to generate vocabulary analyzers, which are equipped with stacks. The improvement of the analyzer has greatly improved the speed of operation, and its energy has completed a lot of work. Load the document into Barrel to establish an index - Analyze a document, then load this document into Barrel, use the Hash table in memory, and each vocabulary is converted into a WordID. When adding new items in the Hash table dictionary, you are awkwardly stored in the file. Once the vocabulary is converted to WordID, they are converted to HitList in the appearance of the current document, and is written into Barrel. The main difficulties in the index phase are parallel to the dictionary needs to be shared. The method we use is that there are 1.4 million fixed vocabulary in the basic dictionary, not written in the basic dictionary, not a shared dictionary. This method multiple indexers can work in parallel, and the last indexer only needs to process a smaller extra vocabulary log. Sorting - In order to establish a reverse index, the sequencer reads each forward Barrel, sorted in WordID, and establishing only the reverse index of the title Anchor Hi T Barrel and full-text reverse index Barrel. This process only processes a Barrel once, so only a small amount of temporary storage space is required. The sorting phase is also parallel, and we simultaneously run as many gerbers as possible, different sequencers handle different buckets.
Since Barrel is not suitable for loading the main memory, the sequencer is further divided into several baskets in accordance with WordID and DOCID to suit the main memory. The sequencer then puts each basket into the main memory and writes its content back to the short reverse Barrel and the full-text reverse barrel. 4.5 Searching for Search is to provide effective high quality search results. Most large commercial search engines seem to have a great effort in efficiency. Therefore, our research focuses on search quality, and believes that our solutions can also be used in those commercial systems. Google query evaluation process is shown in Figure 4. Analysis query. 2. Convert the vocabulary into WordID. 3. Find the beginning of each vocabulary DOCList in the short barrel. 4. Scan DOCLIST until you find a document that matches all keywords 5. Calculate the Rank 6 of the document 6. If we are short Barrel, and at the end of all DOCLIST, start to find each word from the full text Barrel's doclist, goto Step 4 7. If you are not at the end of any DoClist, return the fourth step. 8. Return the previous k in accordance with the RANK sort. Figure 4 Google Query Evaluation In a limited response time, once a certain number of matching documents are found, the search engine automatically performs step 8. This means that the return result is the sub-optimization. We now study other methods to solve this problem. In the past, the HIT in the past, it seems to improve this situation. 4.5.1 Ranking System Google saves more web information than typical search engines. Each Hitlish includes location, a font size, and uppercase. In addition, we also consider the link describing the text. Rank integrated all this information is difficult. The Ranking function design is based on no factor impact on the Rank. First, consider the simplest case - a single word query. For a single word query, a document is found in the HitList in the document. Google believes that each Hit is one of several different types (title, link description text, URL, ordinary large-size text, ordinary font, ...), each has its own type of weight. Type power reconstructed a type index vector. Google calculates the number of each HIT in HitList. Then each HIT number is converted into count-weight. Count-weight starts to increase the number of HIT numbers, and quickly stopped, so that the number of HIT is not related to this. We calculate the rat size of the count-weight vector and the Type-Weight vector as the IR value of the document. The last IR value combined with PageRank as the last Rank of the document for a prefix query, more complex. Now, multi-word HitList must be scanned simultaneously so that keywords appear in the same document than when they appear. The HIT of the neighboring word matches. The adjacentness is calculated for each matching HIT. The neighborhood is based on the distance from the message in the document, divided into 10 different bin values, and the scope is not related from the phrase. Not only calculates each type of HIT number, but also calculates the neighborness of each type, each type of similarity pair, with a type adjacentness Type-Prox-Weight. COUNT is converted into count-weight, calculating the number of scales in Count-Weight Type-Proc-Weight as an IR value. Applying some of these numbers and matrices with the query results together with the query results. These displays help to improve the Rank system. 4.5.2 Feedback Rank Functions There are many parameters like Type-Weight and Type-Prox-Weight. Specifically, the correct value of these parameters is a bit black art Black Art. To this end, our search engine has a user feedback mechanism. A trusted user can evaluate the result of returning. Save feedback. Then, when modifying the RANK function, compare the previous search, we can see the impact of the changes.
Although not perfect, it gives some ideas, and the effect of the search results when the RANK function changes. 5 Quality of the search results is the most important metrics of the search engine. The full user evaluation system exceeds the scope of this article. For most searches, our experience shows that Google's search results are better than those major commercial search engines. As an application of PageRank, link describes text, adjacentness, Figure 4 shows the results of Google Search Bill Clinton. It illustrates some of Google's characteristics. The server clusters the results. This is quite helpful for filtering results. This query, a considerable part of the result from the Whitehouse.gov domain, which is what we need. Most business search engines do not return any results from Whitehouse.gov, which is quite wrong. Note that the first search result is not a title. Because it is not caught. Google is based on the link description text to determine it is a good query result. Similarly, the fifth result is an email address, of course, it is impossible to catch. It is also the result of the link describing the text. All of these results are high, and finally check without dead links. Because most of them are high. PageRANK percentage is expressed in red lines. There is no result containing Bill without clinton or only CLINTON has no BILL. It is very important because the appearance of words is very important. Of course, the real test of search engine includes a wide range of user learning or results, which is limited, please readers to experience Google, http: //google.stanford.edu/. 5.1 Storage requirements In addition to search quality, Google's design can effectively increase costs with the increase in web size. On the one hand, it effectively utilizes storage space. Table 1 lists some of the statistical details tables and Google storage requirements. Due to the application of the compression technology, only 53GB of storage is required. It is all one-third of the data to store data. According to today's disk price, the knowledge base is cheap relative to useful data. The storage space required by the search engine is approximately 55GB. Most query requests only need short reverse index. File Index Apply advanced coding and compression technology, a high-quality search engine can run in 7GB new PC. 5.2 The efficiency of the system performs the search engine gripping and establishing an index. Google's main operation is to grip, index, sort. It is difficult to test how much time it takes to catch all the pages because the disk is full, the domain name server crashes, or other problems cause the system to stop. In general, it takes about 9 days to download 26000000 web pages (including errors). However, once the system is running smoothly, the speed is very fast, downloading the last 1100,000 page only 63 hours, with an average of 4000000 web pages per day, 48.5 web pages per second. The indexer and network crawling robots are running synchronously. The indexer is fast than the network crawling robot. Because we spend a lot of time optimization of the indexer, it is not a bottleneck. These optimizations include bulk update document indexes, arrangements for local disk data structures. The indexer handles 54 web pages per second. The nickname is completely parallel, with 4 machines, the entire process of sorting is about 24 hours. 5.3 Search Execution Improvement Search Execution is not the focus of our research. The current version of Google can answer query requests between 1 and 10 seconds. Most of the time costs on the NFS disk IO (because the disk is generally slower than the machine). Further, Google does not do any optimization, such as query buffer, commonly used vocabulary index, and other commonly used optimization techniques. We tend to improve the speed of Google through distributed, hardware, software, and algorithms. Our goal is to handle hundreds of requests per second. Table 2 has several examples of the current version Google response time. They illustrate the impact of the IO buffer on the re-search speed. 6 Conclusions Google is designed into scalable search engines. The main goal is to provide high quality search results on the rapidly developed World Wide Web.
Google applies some technological improved search quality including PageRank, link description text, adjacent information. Further, Google is a collection of web pages, establishs an index, and performs a complete architecture of search requests. 6.1 Future Work Large Web Search Engine is a complex system, there are still many things to do. Our direct goal is to improve search efficiency and cover approximately 10,000,000 web pages. Some simple improvements have improved efficiency including request buffers, cleverly allocate disk space, and subsoles. Another field that needs to be studied is updated. We must have a clever algorithm to determine which old pages need to be re-captured, which new pages need to be captured. This goal has been implemented. Driven by demand, create a search database with proxy Cache is a promising research area. We plan to add some simple features that have been supported by commercial search engines, such as Boolean arithmetic symbols, negation, filling. However, other applications have just begun to explore, such as related feedback, clustering (Google now supports simple hostname-based clusters). We also plan to support user context (like user address), result. We are expanding the application of link structures and link text. Simple experiment proves that PageRank can personalize by increasing the weights or bookmarks of the user home page. For link text, we are experimenting with the text around the link to the link text. The Web Search Engine provides a wealth of research topics. So much more, we can't listed in this one, so in the near future, we hope that the work is not mentioned in this section. 6.2 High Quality Search The biggest problem facing today's web search engine users is the quality of search results. As a result, it is often funny and beyond the user's eyes, they often discouraged valuable time. For example, a result of a most popular business search engine search "Bill Clillton" is The Bill Clinton Joke of The Day: April 14, 1997. Google's design goals are to provide high quality search results with the rapid development of the Web, which is easy to find information. To this end, Google's large number of application hypertext information includes link structures and link text. Google also uses neighborhood and font size information. The evaluation of the search engine is difficult, and we subjectively discovered that Google's search quality is higher than today's commercial search engine. The PageRank analysis link structure enables Google to evaluate the quality of the web page. Use the link text description link to the web page pointed to the search engine to return the relevant result (some degree of quality). Finally, the use of adjacent information greatly improves the relevance of many search. 6.3 Upgradeable Architecture In addition to search quality, Google is designed to be upgraded. Space and time must be efficient, several factors that are fixed when the entire Web is handled. Realize Google System, CPU, Access, Memory Capacity, Disk Treatment Time, Disk Throughput, Disk Capacity, Network IO is a bottleneck. In some operations, the improved Google has overcome some bottlenecks. The main data structure of Google can effectively utilize storage space. Further, the webpage crawling, indexing, sorting is enough to build most of the Web indexes, a total of 2400,000 web pages, which is less than a week. We hope to establish a 10,000,000 page index within one month. 6.4 Research Tools Google is not only a high quality search engine, but it is still a research tool. Google Collecting data has been used in many other papers, submitted to academic conferences and many other ways. Recent studies, for example, the limitations of web queries are proposed, and you can answer if the network is required. This shows that Google is not only important research tools, but also is essential, widely used. We hope that Google is the resources of the world's researchers and drives the replacement of search engine technology.
7 Acknowledgments Scott Hassan and alan Steremberg Reviewed Google's improvements. Their talents can not be replaced, the authors are sincere thanks them. Thank hector Garcia-Molina, Rajeev Motwani, Jeff Ullman, And Terry Winograd, and all WebBase development group support and profoundly discussions. Finally, IBM, Intel, Sun, and investors have generous support to provide us with equipment. The research described here is part of the Stanford Integrated Digital Library Program, supported by the National Science Natural Fund, and the Cooperation Agreement No. IRI-9411306. Industrial partners in Darpa, Nasa, Interva, and Stanford Digital Library Program have also provided funding for this partner agreement. References? Google's design goal is to upgrade to 1 billion web pages. Our disks and machines can probably handle so many pages. The total time consumed in the system is parallel and linear. Includes webcry robots, indexers and sequencers. After extension, we believe that most data structures have been running well. However, 1 billion web pages are close to the limit of all common operating systems (we are currently running on Solaris and Linux). Includes the number of main memory, the number of open file descriptors, network socket and bandwidth, and other factors. We believe that when the number of web pages exceeds 10 billion pages, we will greatly increase system complexity. 9.2 Upgradeability of Centralized Index System With the improvement of computer performance, the cost of massive text index is fair. Other applications such as videos with high bandwidth demand are more common. However, compared with multimedia, such as video, the cost of text products is low, so text is still generally. Figure 2 Workflow of Google System (Note: Original map From Sergey Brin and Lawrence Page, The Anatomy of A Large-Scale Hypertextual. Web Search Engine, 1998.http://www-db.stanford.edu/~backrub/ Google.html 1GOOGLE uses the roaming travers in the high-speed distributed crawler (Googlebot) to traverse the web page to the Store Server. 2 Storage Server Use ZLIB Format Compression Software to store these web pages after returning the compression process into the database repository. After receiving the complete HTML code for each page, the repository has analyzed the compressed web page and URL, records the web page length, URL, URL length, and web content, and gives each web page document number (DOCID) so that When the system fails, the data recovery of the web page can be performed in a timely manner. 3 Indexer (INDEXER) reads data from the repository, and then do the following four steps: 4 (a) After the read data decompressed, it analyzes the meaning of each meaningful word in the web page, transforming For a number of index items (HITS) for keyword (WordID), the list of indexes, including keywords, keywords, size and case of keywords, and more. The index item list is stored in the Barrels and generates a compliance index that is sorted in the document number (DOCID). The index item is divided into two of its importance: When the keyword in the index item appears in the URL, title, anchor text, and the label, it means that the index is more important, called special index (Fancy Hits) The rest is called a plain hits.
Each HIT is used in the system indicates that the special index item indicates case in size, and it is a special index with a binary code 111 (3 digits), and the remaining 12 positions have 4 indicates the type of special index (ie Hit is in the URL, title, link node or label), and the remaining 8 digits represent the specific location of the HIT in the web page; the normal index is case sensitive, 3 bits indicate the font size, and the remaining 12 bits represent the specific location in the web page. The storage structure of the combo index and HIT is shown in Figure 3. Figure 3 The storage structure of the computational index and HIT is worth noting that when the special index is from Anchor Text, the special index item is used to represent the location of information (8 bits) will be divided into two parts: 4 indicates that Anchor Text appears. Specifically, the other 4 is used to connect to a DOCID that represents the network page of Anchor Text, which is converted from the URL Resolver to deposit into a commented index. (B) In addition to the analysis of the meaningful words in the web page, all hypertext links are analyzed, and the key information such as the network is also analyzed in the Anchor document library. (C) The indexer generates an index word table (LEXICON), which includes two parts: list of keywords and pointer lists for inverting files (as shown in Figure 3). (D) The indexer also arranges the analyzed web page into a document index that is connected to the Repository and records the URLs and titles of the web page to accurately find the original web content stored in the Repository. Also passed to the URL Server to the URL Server in order to perform indexing analysis in the next workflow. The 5url Resolver reads the information in the Anchor document and then does the job in 6. 6 (a) Convert the URL of the anchor text to the web page to the web page Docid; (b) form "link pair" to the DOCID of the original web page, store it in the link database; (c) Put Anchor The DOCID of the web page pointing to the page is connected to the combo special index Anchor Hits. 7 Database LINK records the link relationship of the web page to calculate the PageRank value of the web page. 8 Document Index (Document Index) passes the page without index analysis to the URL Server, and the URL Server provides the URL to be traversed to CRAWLER, so that these unlimited web pages will be indexed in the next workflow. ⑨ ⑨ ⑨ (SORTER) re-sorted the Barrels' compliance index, generating an inverted arranging index indexed in keyword (WordID). The inverted arranging index structure is shown in Figure 4. Figure 4 inverting the index structure ⑩ Generate a new index word for the index word table generated by the indexer (Searcher) )use. The functionality of the search is implemented by the web server, and the search in accordance with the newly generated index word combines the above document index (Document Index) and the Link database value calculated by the LINK database.