Source: e800.com.cn
The content extracts the search engine to establish a web index, and the object is handled is a text file. For web spiders, grabbing web pages include various formats, including HTML, Pictures, DOC, PDF, Multimedia, Dynamic web pages, and other formats. After these files are grabbed, you need to extract text information in these files. Accurately extract information about these documents, on the one hand, there is an important role in searching accuracy of search engines, on the other hand, there is a certain impact on network spiders correctly track other links. For documents such as DOC, PDF, this document that is generated by the software provided by the professional manufacturer, the vendor provides the corresponding text extraction interface. Network spiders only need to call the interfaces of these plugins to easily extract text information in the document and other related information. The documentation such as HTML is different. HTML has a set of own syntax, which represents different fonts, colors, location, and other layers, such as:,,,,,,, " Filter identifiers are not difficult because these identifiers have certain rules, as long as they obtain corresponding information according to different identifiers. However, when identifying this information, you need to record many layout information, such as the font size of the text, whether it is a title, whether it is a keyword of the page, which helps calculate the word in the web page. Importance. At the same time, for the HTML web page, in addition to the title and body, there will be many ad links and public channel links. These links and text body have no relationships. When the webpage content is extracted, they also need to filter these useless links. For example, a website has the "Product Introduction" channel, because the navigation strips are available in each page in the website, if the navigation strip link is not filtered, when searching "Product Introduction", each page can be searched, no doubt It will bring a lot of spam. Filtering these invalid links requires a large number of web structure laws, extracts some common, unified filtration; for some important and special websites, you also need individual processing. This requires a certain extensibility of the network spider. For multimedia, pictures, etc., generally judge the content of these files by linking anchor text (ie, link text) and related file comments. For example, there is a link text to "Zhang Manju Photo", which point to a picture of a BMP format, then the web spider knows the content of this picture is "Zhang Manyu's photo". This way, you can find this picture when searching "Zhang Manyu" and "photo". In addition, there are file properties in many multimedia files, considering these properties, you can better understand the contents of the file. The dynamic web page has always been a problem with the network spider. The so-called dynamic web page is for a static web page, which is a page generated by the program. Such a benefit is that the webpage can be quickly unified, or the space occupied by the web page can also reduce the space of the server, but also give the web spider. Come some trouble. Due to the continuous increase in development, there are more and more types of dynamic web pages, such as: ASP, JSP, PHP, etc. These types of web pages can be slightly easier for web spiders. Network spiders are more difficult to handle web pages generated by some scripting languages (such as VBScript and JavaScript). If you want to improve these web pages, the web spider needs your own script interpretation. For many data is placed on the database, you need to obtain information through the database search of this website, which brings great difficulties to the web spider. For such a website, if the website designer wants these data to be searched by search engine, you need to provide a method of traversing the entire database content. For the extraction of web content, it has always been an important technology in the network spider. The entire system generally adopts the form of plugins, through a plug-in management service program, encountered different plug-in processing from the page of different formats.