Network spider secret
Published: 2004-7-16 10:32:53 Source:
Your blog.com (Yourblog.org)
Readers who often go to Yahoo, Sohu and other websites must have such questions: such a huge web information, how is it collected by these websites? Is it manually registered and organized? of course not. These search engines can quickly find such information, and the application of the network spider is not open.
The web spider can be counted in one of the most useful tools developed for the Internet. Today, if you want to get information from 10 million different sites, there are herself outside the web spider.
A typical network spider (such as Yahoo) works, is a page to view a page and find relevant information, then it will start from all links in this page and continue to find relevant information. Push themselves until the exhaust. Soon, you can get thousands of pages and information in the database. This way is like a spider web that is spread out, this is the origin of the name of "Network Spider".
Let's take a look at how to build a web spider. Before this, let's first understand a few concepts.
I. Basic principle
We can search a lot with a web spider. In fact, there are now some special commercial web spiders that earn a lot of banknotes for their developers, such as ALTAVISTA technology's license, worth $ 300,000. The following is the basic principle of a network spider:
* Collect information from each message source
From a technical perspective, a network spider should be able to obtain information from any source from any source without restrictions. There are a lot of sources.
* Accuracy
No matter who, I will have a crash - the search engine returns a million results to you, but only the last two is what you need (this is good, if is the middle of the middle? ). So good network spider's result should have sufficient accuracy, and in some cases, there must be a specific function, that is, returns only specific type information - ratio
Www.enfused.com is designed for search for game design, only returns something related to games.
* Relative update
This depends on the technology you use (below we will be specifically mentioned), the network spider should retrieve the updated information, or at least relatively new information. If the web spider always retrieves Chen Zhiji, a few years ago, you will definitely crash more than the system.
* Relative fast
This doesn't have to say more. If there is not enough speed, then your web spider is how accurate, it is also white.
Second, the basic technology
There are several ways to build a web spider. The first, called a conventional network spider, just simply perform the page to find, search and get what you want. For example, a phrase is used as a keyword for a keyword. Second, special web spiders, only look for specific parts of the page. This network spider is useful in some specific occasions (for example, you just want to get news headings in a certain site).
Conventional network spiders are a relatively simple one in both. First, you don't need to know the situation of the target page in advance. Just in this page, and in the page with your link, you can find the keywords you want. You can also set up in the function, ignore the links under the same site, thereby ensuring that each result is from different sites.
Correspondingly, a specific network spider typically requires you to know some target pages in advance, such as Table Plan. For example, if you search for news headlines in a page, you should first know the HTML tag of this title. So you can search directly from the correct part. In this case, if you have the function of searching all links to this page seem to be particularly important, because your network spider is likely to find tags in other pages, you cannot work. Time to run the network spider is different: you can run in advance or run in real time. Pre-run means that when your network spider is running, all collected information is stored in a database for later use. Obviously, so you will not get the latest data, but if you often run the web spider, this problem will not have anything.
Real-time operation means that you will not be saved each time you run the web spider, you can only find it. For example, if you set the search function at the site, use the web spider in real-time state, then whenever you enter a keyword and click the "Send" button, your web spider will run, not just Access the database. Although this can ensure that your data is always the latest, but it is not the first choice for most sites, because the network spider itself takes time and return data - and time is money! Of course, the information found has a high time sensitivity.
Third, build a network spider
So how do you build a web spider with ASP? The answer is: Internet Transfer Control (ITC). This controlled control provided by Microsoft will enable you to access Internet resources through the ASP program. You can search for web pages with ITC, access the FTP server, and even send a message title. In this article, we will focus on the function of searching for web pages.
There are several defects that must be explained first. First, the ASP does not have the right to access the Windows registry, which makes the constants and values that some ITCs are not available. Usually you can solve this problem by setting up ITC to "Don't use the default", which requires you to indicate each value during the run.
Another more serious problem is about licensing. Since the ASP does not have the ability to call License Manager (a feature in Windows, you can guarantee the legal use of components and controls), then the License Manager checks the key password of the current component, and compares it with the Windows registry. If they find them, the component will not work. Therefore, when you want to configure the ITC to another without the required key, it will cause an ITC to crash. One of the solutions is to bundle ITC to another VB component, copy the path and tools of ITC by VB components to configure. This work is very troublesome, but unfortunately, it is essential.
Here are some examples:
You can build ITC with the following encoding:
SET INET1 = CreateObject ("inetctls.inet")
Inet1.protocol = 4 'http
Inet1.accesstype = 1 'Direct Connection to Internet
Inet1.requestTIMEOUT = 60 'in seconds
Inet1.url = Strurl
TRHTML = inet1.openurl 'GRAB HTML PAGE
Now strHTML holds the HTML content of the entire page pointing to Strull. To create a regular web spider, you only need to call the Instr () feature to see if the string you are looking for is in the current location. You can also find the HREF tag, parse the current URL, then set it to the properties of the Internet control, then continue to open another page. The best way to see all links is to use recursive. It should be noted that although this method is easy to implement, it is not very accurate and powerful. Today's many search engines can perform additional logic inspections, such as calculating a number of times a phrase repeating in a page, the approximate degree of associated words, etc., can even be used to determine the segments of the searches and contexts. These features will leave our readers to explore themselves.
Fourth, specific network spider
Relative, a specific network spider is complex. As we mentioned earlier, a specific network spider will search for a particular part of a page, so it is required to pre-know the situation in advance. Let's take a look at the HTML below:
In this page, we only care about something between "Put Headlines Here" and "End Headlines". You can build a function setting that only returns the result of the area lookup:
Function GetText (strText, strstarttag, strendtag)
DIM INTSTART
Intstart = INSTR (1, Strtext, StrstartTAG, VBTextCompare)
IF Intstart Then
IntStart = intStart Len (strStartTAG)
INTEND = INSTR (intstart 1, strText, strendtag, vbtextcompare)
GetText = MID (Strtext, Intstart 1, Intend - IntStart - 1)
Else
GetText = "" END IF
END FUNCTION
In accordance with the example of building an ITC control above, you can easily transfer "
Note that the tags used to start and end are not necessarily the actual HTML dedicated tag - they can be any text deficit you want to use. Under normal circumstances, you don't easily find a good HTML tag to define the search area. You can only use the more convenient tag - for example, your tail tag can be as follows:
strStartTAG = "/ TD>
"
STRENDTAG = "
td> tr>
Be sure to determine that the search is a unique identity in the HTML page so that you can accurately get what you need. You can also search for the links in the text section you returned, but if you don't know the format of those pages, your network spider will reach it.
V. Save information
In most cases, you will ask for saving the collected information in a database for later use. Your needs may include a wide range of content, but before this, you have to remember the following things:
Bamboo
Find the latest information in your database
If you often use the web spider to find a news headline in a site, you must first determine that the new title already exists in the database. Then compare it with the results returned by the network spider, only the updated part is added. This prevents you from saving a lot of repeating data.
Bamboo
Update information
Maybe you don't want to add new information from the outside of the database. For example, if you maintain an online index of a US population, you only need to update in the database - you will not need to insert new information in the table.
Bamboo
Save the required information
If you are looking for headings, be sure you also find the link points to this title and save it. If there is no link, you should also build one. For example, if I am from WWW. Yoursite. CoM looks the title and at WWW. Mysite. COM presentation, and the title is a link with an article outside the site, then I must save http: // www. Yoursite. COM link, then save other links into the database, these links can be accessed normally.
Sixth, conclusions
We have briefly introduce how to build a complete web spider. All basic functions have been involved. Now you need to do it, add your own things.