(Transfer) How to construct spider programs in C # language

xiaoxiao2021-03-06  24

"Spider is a very useful program on the Internet. The search engine uses spider programs to collect web pages to the database. Enterprises use spider programs to monitor competitors' websites and track changes, personal users download the web page with spider programs to take off User use, developers use spider programs to scan their web check-in-in-one link ... For different users, spider programs have different purposes. So, how is the spider program work?

The spider is a semi-automatic program, just like a spider in real real as a web (spider web), the spider program is traveling on a web link to web links on a web link. The spider program is semi-automatic because it always requires an initial link (starting point), but since the running situation is determined by itself, the spider program will scan the links contained in the start page, and then access these link points to Page, then analyze, and track the links contained in those pages. In theory, the final spider program will visit every page on the Internet, because almost every page on the Internet is always referenced by other or less page references.

This article describes how to construct a spider program with a C # language, which can download the content of the entire website to a specified directory, and the run interface of the program is shown in Figure 1. You can easily use several core classes provided herein to construct your spider program.

figure 1

C # is especially suitable for constructing spider programs because it has built-in HTTP access and multi-threading capabilities, and these two capabilities are very critical for spider programs. Below is a key issue to construct a spider program to solve:

(1) HTML Analysis: An HTML parser needs to analyze each page encountered by the spider program.

(2) Page processing: You need to handle each download obtained page. The downloaded content may be saved to disk or further analyze it.

(3) Multi-thread: Only have multi-threaded power, spider programs can really be efficient.

⑷ Determine when to complete: Don't underestimate this problem, determine if the task has been completed, especially in multithreading environments.

First, HTML analysis

The C # language itself does not contain the ability to resolve HTML, but support XML resolution; however, XML has a strict syntax, which is useless for HTML for XML design, because HTML's syntax is much easier. To do this, we need to design an HTML parser. The parser provided herein is highly independent, you can easily use it for other applications with C # processing HTML.

The HTML parser provided herein is implemented by the PARSEHTML class. It is very convenient to use: First create an instance of such a class, and then set it to the HTML document to be parsed:

PARSEHTML PARSE = New parasehtml ();

Parse.source = "

Hello World ";

Next, you can use a loop to check all the texts and tags of the HTML document. Typically, the inspection process can start from a test EOF method:

While (! Parse.eof ())



The Parse method returns the html document containing characters - it returns only those non-HTML tag characters, and if the HTML tag is encountered, the PARSE method will return 0 value, indicating that there is now an HTML tag. After encountering a marker, we can use the gettag () method to process it.

IF (CH == 0)




Generally, one of the most important tasks of spider programs is to find each HREF property, which can be done with the C # index function. For example, the following code will extract the value of the HREF attribute (if present). Attribute href = tag ["href"];

String link = href.value;

After getting an Attribute object, the value of this property can be obtained by Attribute.Value.

Second, handle HTML pages

Let's take a look at how to handle the HTML page. The first thing to do is of course downloading the HTML page, which can be implemented by the HttpWebRequest class provided by C #:

HttpWebRequest Request = (httpwebrequest) WebRequest.create (m_uri);

Response = Request.getResponse ();

stream = response.getResponsestream ();

Next we create a stream stream from Request. Before performing other processing, we must first determine that the file is binary or a text file, and the different file type processing is also different. The following code determines if the file is binary.

if (! Response.contenttype.tolower (). StartSwith ("Text /"


Savebinaryfile (response);

Return NULL;


String buffer = "", line;

If the file is not a text file, we read it as a binary file. If it is a text file, first create a streamReader from Street, and then join the buffer one by one line of the contents of the text file.

Reader = New StreamReader (stream);

While (Line = Reader.Readline ())! = NULL


Buffer = line "/ r / n";


After loading the entire file, then save it as a text file.

SavetextFile (buffer);

Let's take a look at the storage methods of these two different files.

The content type of binary file is not started with "text/", and the spider program is stored directly to disk without having to handle additional processing. This is because the binary file does not contain HTML, so there will be no need for spider program processing. HTML link. Here is the step of writing binary files.

First prepare a buffer to temporarily save the contents of binary files. Byte [] buffer = new byte [1024];

Next, you must make sure the file saves to the local path and name. If you want to download a Myhost.com website to a local C: / Test folder, binary online path and name are

Http://myhost.com/images/logo.gif, the local path and name should be C: /Test/Images/logo.gif. At the same time, we have to make sure that the images subdirectory has been created in the C: / Test directory. This part of the task is completed by the ConvertFileName method.

String filename = convertFileName (response.responseuri


The ConvertFileName method separates the HTTP address to create a corresponding directory structure. After identifying the names and paths of the output file, you can open the input stream of the WEB page, write to the output stream of the local file.

Stream outstream = file.create (filename)


Stream Instream = response.getResponseSstream ();

Next, you can read the contents of the web file and write to a local file, which can be easily completed by a loop. Int L;



L = instream.read (buffer, 0,


IF (l> 0)

Outstream.write (buffer, 0, l);

} While (l> 0);

After writing the entire file, turn off the input stream and output flow.

Outstream.close ();

Instream.close ();

Comparison, downloading text files is easier. The content type of the text file is always starting with "text/". Suppose the file has been downloaded and saved to a string, which can be used to analyze the links contained in the web page, of course, can be saved as a file on the disk. The task of the following code is to save the text file.

String filename = convertFileName (M_URI


StreamWriter Outstream = New Streamwriter (filename


Outstream.write (buffer);

Outstream.close ();

Here, we first open a file output stream, then write the contents of the buffer, and finally close the file.

Third, multi-thread

Multithreading makes the computer look like an operation at the same time, but unless the computer contains multiple processors, otherwise, the so-called simultaneous implementation of multiple operations is only an analog effect - relying on a computer in multiple Rapid switching between threads reaches "simultaneous" to execute the effect of multiple operations. In general, only multithreading in both cases can realize the speed of the program operation. The first case is that the computer has multiple processors, and the second case is that the program is often waiting for an external event.

For spider programs, the second case is one of its typical features, and each time a URL request is always waiting for the file download, then request the next URL. If the spider program can request multiple URLs at the same time, it is clear that the total download time can be effectively reduced.

To do this, we encapsulate all the downloads of downloading a URL with the DocumentWorker class. Whenever a DocumentWorker is created, it enters the loop, waiting for the next URL to be processed. Here is the main loop of DocumentWorker:

While (! m_spider.quit


m_uri = m_spider.obtainwork ();

m_spider.spiderdone.Workerbegin ();

String Page = getPage ();

IF (Page! = NULL)

ProcessPage (Page);

m_spider.spiderdone.Workerend ();


This loop will always run until the Quit tag is set to TRUE (when the user clicks the "Cancel" button, the quit tag is set to true). Within the loop, we call Obtainwork to get a URL. Obtainwork will wait until there is a URL available - this is to parse the document by other threads and look for links. The Done class uses the WorkerBegin and the Workerend method to determine when the entire download operation has been completed.

As can be seen from the figure, the spider program allows users to determine the number of threads to use. In practice, the optimal number of threads is affected by many factors. If your machine is higher, or there are two processors, there are many threads number; contrary, if the network bandwidth, the machine performance is limited, the number of threads set is not necessarily to improve performance.

Fourth, is the mission is completed?

Download files using multiple threads effectively improve performance, but also brings problems with thread management. One of the most complicated issues is: When is the spider program to complete? Here we have to judge a dedicated class Done. It is necessary to explain the specific meaning of "completion of work". The work of spider programs is completed only when there is no downloaded URL waiting for download in the system, and all working threads have ended their processing. In other words, the completion of the job means that there is no longer waiting for download and URL being downloaded.

The Done class provides a Waitdone method, its function is waiting until the Done object detects that the spider program has completed work. Here is the code of the waitDone method.

Public void waitdone ()


Monitor.enter (this);

While (M_ActiveThreads> 0


Monitor.wait (this);


Monitor.exit (this);


Waitdone method will always wait until there is no longer an active thread. However, it must be noted that the initial phase of the download starts does not have any activity threads, so it is easy to stop the spider program at the beginning of the spider program. To solve this problem, we also need another method Waitbegin to wait for spider programs to enter the "formal" work phase. The general call order is: first call Waitbegin, then call Waitdone, Waitdone will wait for the spider program to complete the job. Below is the code of Waitbegin:

Public void waitbegin ()


Monitor.enter (this);

While (! m_started


Monitor.wait (this);


Monitor.exit (this);


Waitbegin method will wait until the m_started tag is set. The m_started tag is set by the WorkerBegin method. When the work thread is started to process each URL, the workerbegin is called; calling Workeend at the end. WorkerBegin and Workerend These two ways help DONE objects determine the current working status. Below is the code of the Workerbegin method:

Public void workerbegin ()


Monitor.enter (this);

M_ActiveThreads ;

m_started = true;

Monitor.Pulse (this);

Monitor.exit (this);


The Workerbegin method first adds the number of current active threads, then sets the m_started tag, and finally call the PULSE method to notify (possible) Waiting the thread to start. As mentioned earlier, the method of waiting for a Done object is a Waitbegin method. Every URL is processed, the workend method will be called:

Public void workend ()


Monitor.enter (this);

M_ActiveThreads -

Monitor.Pulse (this);

Monitor.exit (this);


WORKEREND method reduces the M_ActiveThreads Active thread counter, calling Pulse to release threads waiting for Done objects - As mentioned earlier, the method that may wait for a Done object is Waitdone method.

Conclusion: This article introduces the basics of the development of the Internet spider program, the source code provided below will help you further understand the theme of this article. The code provided here is very flexible, you can easily use it for your own procedure.


New Post(0)