How to construct spider programs in C # language

xiaoxiao2021-03-06  91

Author: Jeff Heaton; cactus studio Translation

"spider

"(Spider) is a very useful program on the Internet. The search engine uses spider programs to collect web pages to the database. Enterprises use spider programs to monitor competitors' websites and track changes, personal users download the web page with spider programs for offline Using, developers use spider programs to scan their Web check-in-use ... For different users, spider programs have different purposes. So, how is the spider program work?

The spider is a semi-automatic program, just like a spider in real real as a web (spider web), the spider program is traveling on a web link to web links on a web link. The spider program is semi-automatic because it always requires an initial link (starting point), but since the running situation is determined by itself, the spider program will scan the links contained in the start page, and then access these link points to Page, then analyze, and track the links contained in those pages. In theory, the final spider program will visit every page on the Internet, because almost every page on the Internet is always referenced by other or less page references.

This article describes how to construct a spider program with a C # language, which can download the content of the entire website to a specified directory, and the run interface of the program is shown in Figure 1. You can easily use several core classes provided herein to construct your spider program.

figure 1

C # is especially suitable for constructing spider programs because it has built-in HTTP access and multi-threading capabilities, and these two capabilities are very critical for spider programs. Below is a key issue to construct a spider program to solve:

(1) HTML Analysis: An HTML parser needs to analyze each page encountered by the spider program.

(2) Page processing: You need to handle each download obtained page. The downloaded content may be saved to disk or further analyze it.

(3) Multi-thread: Only have multi-threaded power, spider programs can really be efficient.

⑷ Determine when to complete: Don't underestimate this problem, determine if the task has been completed, especially in multithreading environments.

First, HTML analysis

The C # language itself does not contain the ability to resolve HTML, but support XML resolution; however, XML has a strict syntax, which is useless for HTML for XML design, because HTML's syntax is much easier. To do this, we need to design an HTML parser. The parser provided herein is highly independent, you can easily use it for other applications with C # processing HTML.

The HTML parser provided herein is implemented by the PARSEHTML class. It is very convenient to use: First create an instance of such a class, and then set it to the HTML document to be parsed:

The following is the code content:

PARSEHTML PARSE = New parasehtml ();

Parse.source =

Hello World

"

The following is the code content:

Next, you can use a loop to check all the texts and tags of the HTML document. Typically, the inspection process can start from a test EOF method:

The following is the code content: while (! Parse.eof ())

{

CHAR CH = PARSE.PARSE ();

The following is the code content:

The Parse method returns the html document containing characters - it returns only those non-HTML tag characters, and if the HTML tag is encountered, the PARSE method will return 0 value, indicating that there is now an HTML tag. After encountering a marker, we can use the gettag () method to process it. The following is the code content: IF (ch == 0)

{

HTMLTAG TAG = PARSE.GETTAG ();

}

The following is the code content:

Generally, one of the most important tasks of spider programs is to find each HREF property, which can be done with the C # index function. For example, the following code will extract the value of the HREF attribute (if present).

The following is the code content: attribute href = tag [

"Href

"];

String link = href.

Value;

The following is the code content:

After getting Attribute object, through Attribute.

Value can get the value of this property.

Second, handle HTML pages

Let's take a look at how to handle the HTML page. The first thing to do is of course downloading the HTML page, which can be implemented by the HttpWebRequest class provided by C #:

The following is the code content: httpWebRequest Request = (httpwebrequest) WebRequest.create (m_uri);

Response = Request.getResponse ();

stream = response.getResponsestream ();

The following is the code content:

Next we create a stream stream from Request. Before performing other processing, we must first determine that the file is binary or a text file, and the different file type processing is also different. The following code determines if the file is binary.

The following is the code content: if (! Response.contenttype.tolower (). StartSwith (

Text /

"))

{

Savebinaryfile (response);

Return NULL;

}

String buffer =

"

"", LINE;

The following is the code content:

If the file is not a text file, we read it as a binary file. If it is a text file, first create a streamReader from Street, and then join the buffer one by one line of the contents of the text file.

The following is the code content: reader = new streamreader (stream);

While (Line = Reader.Readline ())! = null)

{

Buffer = line

"/ r / n

"

}

The following is the code content:

After loading the entire file, then save it as a text file.

The following is the code content: SavetextFile (buffer);

Let's take a look at the storage methods of these two different files.

The content type of binary file is not

Text /

"At the beginning, the spider program is directly stored directly to the disk without having to do additional processing, because the binary file does not contain HTML, so there is no longer need the HTML link to which the spider program is processed. Below is the step of writing binary files .

First prepare a buffer to temporarily save the contents of binary files.

The following is the code content: byte [] buffer = new byte [1024];

Next, you must make sure the file saves to the local path and name. If you want to download a Myhost.com website to a local C: / Test folder, binary online path and name are http://myhost.com/images/logo.gif, the local path and name should be C: / Test/Images/logo.gif. At the same time, we have to make sure that the images subdirectory has been created in the C: / Test directory. This part of the task is completed by the ConvertFileName method.

The following is the code content: string filename = converTfileName (response.responseuri);

The ConvertFileName method separates the HTTP address to create a corresponding directory structure. After identifying the names and paths of the output file, you can open the input stream of the WEB page, write to the output stream of the local file.

The following is the code content: stream outstream = file.create (filename);

Stream Instream = response.getResponseSstream ();

The following is the code content:

Next, you can read the contents of the web file and write to a local file, which can be easily completed by a loop.

The following is the code content: int L;

DO

{

L = instream.read (buffer, 0,

Buffer.Length);

IF (l> 0)

Outstream.write (buffer, 0, l);

} While (l> 0);

The following is the code content:

After writing the entire file, turn off the input stream and output flow.

The following is the code content: Outstream.close ();

Instream.close ();

The following is the code content:

Comparison, downloading text files is easier. The content type of the text file is always

Text /

"At the beginning. Suppose the file has been downloaded and saved to a string, which can be used to analyze the links contained in the web page. Of course, the files on the disk can be saved. The task of the following code is to save the text file.

The following is the code content: string filename = communicationfilename (m_uri);

Streamwriter outstream = new streamwriter (filename);

Outstream.write (buffer);

Outstream.close ();

The following is the code content:

Here, we first open a file output stream, then write the contents of the buffer, and finally close the file.

Third, multi-thread

Multithreading makes the computer look like an operation at the same time, but unless the computer contains multiple processors, otherwise, the so-called simultaneous implementation of multiple operations is only an analog effect - relying on a computer in multiple Rapid switch between thread reaches

"at the same time

"Execute the effect of multiple operations. In general, only multithreading can only improve the speed of the program operation in both cases. The first case is that the computer has multiple processors, the second case is that the program is often waiting A external event.

For spider programs, the second case is one of its typical features, and each time a URL request is always waiting for the file download, then request the next URL. If the spider program can request multiple URLs at the same time, it is clear that the total download time can be effectively reduced.

To this end, we use

DocumentWorker class package all downloads a URL operation. Whenever one

DocumentWorker's instance is created, it enters the loop, waiting for the next URL to be processed. Below is

DocumentWorker's main loop:

The following is the code content: while (! M_spider.quit) {

m_uri = m_spider.obtainwork ();

m_spider.spiderdone.Workerbegin ();

String Page = getPage ();

IF (Page! = NULL)

ProcessPage (Page);

m_spider.spiderdone.Workerend ();

}

The following is the code content:

This loop will always run until the Quit mark is set to TRUE (when the user clicks

Cancel

"When the button, the quit tag is set to true). Within the loop, we call Obtainwork to get a URL. Obtainwork will wait until there is a URL available - this is to parse the document and look for a link by other threads. The Done class uses the WorkerBegin and the Workerend method to determine when the entire download operation has been completed.

As can be seen from the figure, the spider program allows users to determine the number of threads to use. In practice, the optimal number of threads is affected by many factors. If your machine is higher, or there are two processors, there are many threads number; contrary, if the network bandwidth, the machine performance is limited, the number of threads set is not necessarily to improve performance.

Fourth, is the mission is completed?

Download files using multiple threads effectively improve performance, but also brings problems with thread management. One of the most complicated issues is: When is the spider program to complete? Here we have to judge a dedicated class Done.

First, it is necessary to explain it.

"Complete work

"The specific meanings. Only when there is no downloaded URL in the system, and all the working threads have ended their processing, the spider program work is completed. That is to say, the completion of the job means that they have not waited for download and Download the URL.

The Done class provides a Waitdone method, its function is waiting until the Done object detects that the spider program has completed work. Here is the code of the waitDone method.

The following is the code content: public void waitdone ()

{

Monitor.enter (this);

While (M_ActiveThreads> 0)

{

Monitor.wait (this);

}

Monitor.exit (this);

}

The following is the code content:

Waitdone method will always wait until there is no longer an active thread. However, it must be noted that the initial phase of the download starts does not have any activity threads, so it is easy to stop the spider program at the beginning of the spider program. In order to solve this problem, we also need another way to waitBegin to wait for spider programs to enter.

"formal

"Working stage. The general call order is: first call Waitbegin, then call Waitdone, Waitdone will wait for the spider program to complete the work. Below is the code of Waitbegin:

The following is the code content: public void waitbegin ()

{

Monitor.enter (this);

While (! m_started)

{

Monitor.wait (this);

}

Monitor.exit (this);

}

The following is the code content:

Waitbegin method will wait until the m_started tag is set. The m_started tag is set by the WorkerBegin method. When the work thread is started to process each URL, the workerbegin is called; calling Workeend at the end. WorkerBegin and Workerend These two ways help DONE objects determine the current working status. Below is the code of the workerbegin method: The following is the code: public void workerbegin ()

{

Monitor.enter (this);

M_ActiveThreads ;

m_started = true;

Monitor.Pulse (this);

Monitor.exit (this);

}

The following is the code content:

The Workerbegin method first adds the number of current active threads, then sets the m_started tag, and finally call the PULSE method to notify (possible) Waiting the thread to start. As mentioned earlier, the method of waiting for a Done object is a Waitbegin method. Every URL is processed, the workend method will be called:

The following is the code content: public void workend ()

{

Monitor.enter (this);

M_ActiveThreads -

Monitor.Pulse (this);

Monitor.exit (this);

}

The following is the code content:

WORKEREND method reduces the M_ActiveThreads Active thread counter, calling Pulse to release threads waiting for Done objects - As mentioned earlier, the method that may wait for a Done object is Waitdone method.

Conclusion: This article introduces the basics of the development of the Internet spider program, the source code provided below will help you further understand the theme of this article. The code provided here is very flexible, you can easily use it for your own procedure.

Download this article: WebFinderCode.zip

[ click to download ]