(Transfer) C #VB - Automated WebspiderWebRobot

xiaoxiao2021-03-06  20

Introduction

What is a webspider

A WebSpider or crawler is an Automated Program That Follows Links On Websites and Calls A WebRobot To Handle The Contents of Each Link.

What is a webrobot

A WebRobot is a program that processes the content found through a link, a WebRobot can be used for indexing a page or extracting useful information based on a predefined query, common examples are - Link checkers, e-mail address extractors, multimedia extractors and update Watchers.

BACKSID

I had a recent contract to build a web page link checker, this component had to be able to check links that were stored in a database as well as to check links on a website, both through the local file system and over the internet.

This article explains the WebRobot, the WebSpider and how to enhance the WebRobot through specialized content handlers, the code shown has some superfluous code such try blocks, variable initialization and minor methods removed.

Class overview

The classes that make up the WebRobot are; WebPageState, which represents a URI and its current state in the process chain and an implementation of IWebPageProcessor, which performs the actual reading of the URI, calling content handlers and dealing with page errors.

The WebSpider has only one class WebSpider, this maintains a list of pending / processed URI's contained in a list of WebPageState objects and runs WebPageProcessor against each WebPageState to extract links to other pages and to test whether the URI's are valid.

Using the code - WebRobot

Web page processing is handled by an object that implements IWebPageProcessor. The Process method expects to receive a WebPageState, this will be updated during page processing and if all is successful the method will return true. Any number of content handlers can be also be called after The page has been read, by Assigning WebPageContentdeLegate delegates to the processor.public delegate void WebPageContentDelegate;

Public Interface IWebPageProcessor

{

Bool Process (WebPagestate State);

WebPageContentDelegate ContentHandler {Get; set;}

}

........................ ..

Public Class Webpagestate

{

Private webpagestate () {}

Public WebPageState (URI URI)

{

m_uri = URI;

}

Public WebPageState (String Uri)

: this (new uri (uri)) {}

URI M_URI; // URI to Be Processed

String M_Content; // Content of WebPage

String m_processinstructions = ""; // user defined instructions

// for Content Handlers

Bool M_ProcessStarted = false;

// Becomes True When Processing Starts

Bool M_ProcessSuccessFull = FALSE;

// Becomes True if Process Was Successful

String m_statuscode;

// http status code

String m_statusdescribhip;

// http status description, or exception message

// Standard getters / setters ....

}

The WebPageProcessor is an implementation of the IWebPageProcessor that does the actual work of reading in the content, handling error codes / exceptions and calling the content handlers. WebPageProcessor may be replaced or extended to provide additional functionality, though adding a content handler is generally a better Option.public Class WebPageProcessor: iWebPageProcessor

{

Public Bool Process (WebPagestate State)

{

State.ProcessStarted = True;

State.ProcesssuCcessFull = false;

// USE WebRequest.create to Handle Uri's for

// the folowing schemes: File, HTTP & HTTPS

WebRequest Req = WebRequest.create (State.uri);

WebResponse res = NULL;

Try

{

// Issue a response against the request.

// if any problems are going to happen they

// They area likly to happen here in the form of an exception.

Res = Req.getResponse ();

// if we reach here. Every. Is Likly to Be OK.

IF (RES IS HTTPWEBRESPONSE)

{

State.statuscode =

(Httpwebresponse) res .statuscode.toString ();

State.StatusDescription =

(Httpwebresponse) res) .statusdescription;

}

IF (RES IS FILEBRESPONSE)

{

State.statuscode = "OK";

State.statusDescription = "ok";

}

IF (state.statuscode.equals ("ok"))

{

// read the contents Into yur state

// Object and fire the content handlers

StreamReader SR = New StreamReader

Res.getResponseSstream ());

State.content = sr.readtoend ();

IF (ContentHandler! = NULL)

{

ContentHandler (state);

}

}

State.ProcesssuCcessFull = true;

}

Catch (Exception EX)

{

Handlexception (ex, state);

}

Finally

{

IF (res! = null)

{

res.

}

}

Return state.processusuccessful1;

}

// store ann of Content Handlers

Private WebPageContentDelegate M_ContentHandler = NULL;

Public WebPageContentdeGate ContentHandler

{

Get {return m_contenthandler;}

Set {m_contenthandler = value;}

}

THERE ARE Additonal Private Methods in The WebPageProcessor To Handle Http Error Codes and File NOT FOUND ERRORS WHEN DEALING WITH The "File: //" Scheme As Well as More Severe Excetions.

Using the code - WebSpider

The WebSpider class is really just a harness for calling the WebRobot in a particular way. It provides the robot with a specialized content handler for crawling through web links and maintains a list of both pending pages and already visited pages. The current WebSpider is designed to Start from a Given Uri and to Limit Full Page Processing to a base path.

// conntructors

//

// Process a URI, UnTil All Links Are Checked,

// only add new links for processing if they

// Point to the Same Host as specified in the starturi.

Public WebSpider

String Starturi

: this (starturi, -1) {}

// AS Above Only Limit The Links to UriprocessedCountMax.

Public WebSpider

String Starturi,

Int uriprocessedcountmax

: This (Starturi, ", UriprocessedCountmax,

False, New WebPageProcessor ()) {}

// AS ABOVE, Except New Links Are Only Added IF

// They is on the path specified by baseuri.

Public WebSpider

String Starturi,

String Baseuri,

Int uriprocessedcountmax

: This (Starturi, Baseuri, Uriprocessedcountmax,

False, New WebPageProcessor ()) {}

// as Above, You can Specify WHETHER THE Web Page

// Content Is Kept After IT IS Processed, B

// Default this Would Be false to conserve memory //when buy on large site.

Public WebSpider

String Starturi,

String Baseuri,

Int uriprocessedcountmax,

Bool KeepWebContent,

IWebPageProcessor WebPageProcessor)

{

// Initialize Web Spider ...

}

Why is there a base path limit? Since there are trillions of pages on the Internet, this spider will check all links that it finds to see if they are valid, but it will only add new links to the pending queue if those links belong within The Context of the Initial Website or Sub Path of That Website.

So if we are starting from www.myhost.com/index.html and this page has link to www.myhost.com/pageWithSomeLinks.html and www.google.com/pageWithManyLinks.html then the WebRobot will be called against both links to Check if the is all the area

Call The Execute Method to Start The Spider. This Method Will Add The Starturi To A Queue Of Pending Pages Sing THERE ARE NO PAGES LEFT TO Process.

Public void execute ()

{

AddWebPage (Starturi, Starturi.absoluteuri);

While (WebPagespending.count> 0 &&

(Uriprocessedcountmax == -1 || UriprocessedCount

{

WebPagestate State = (WebPageState) m_webpagespending.dequeue ();

m_webpageProcessor.Process (state);

IF (! KeepWebContent)

{

State.content = null;

}

UriprocessedCount ;

}

}

A Web Page Can Only Be Added to The Queue If The Uri "Excluding Anchor" Points to a path or a valid page (E.g. .html, .aspx, .jsp etc ...) And Has NOT Already Been Seen Before.

Private Bool AddWebpage (Uri Baseuri, String Newuri) {

URI URI = New URI (Baseuri,

Strutil.LeftIndexof (Newuri, "#")));

IF (! ValidPage (uri.localpath) || m_webpages.contains (URI))

{

Return False;

}

WebPagestate State = New WebPageState (URI);

IF (uri.absoluteuri.startswith (baseuri.absoluteuri))

{

State.Processinstructions = "Handle Links";

}

m_webpagespending.enqueue (state);

m_webpages.add (uri, state);

Return True;

}

Examples of Running the Spider

The following code shows three examples for calling the WebSpider, the paths shown are examples only, they do not represent the true structure of this website Note:. The Bondi Beer website in the example, is a site that I built using my own SiteGenerator THIS EASY TO USE Program Produces Static Websites from Dynamic Content Such As ProPrietary Data Files, XML / Xslt Files, Databases, RSS feeds and more ...

/ *

* Check for Broken Links Found On this Website, Limit The Spider To 100 Pages.

* /

Webspider Spider = New Webspider ("http://www.bondibeer.com.au/", 100);

Spider.execute ();

/ *

* Check for Broken Links Found On this Website,

* There is no limited on the number

* of Pages, But it will not look for new links on the NOT LOOK

* Pages That Are Not Within the

* path http://www.bondibeer.com.au/products/. this

* Means That The Home Page Found

* at http://www.bondibeer.com.au/home.html may be

* Checked for EXISTENCE IF IT WAS

* Called from the somepub / index.html but any

* Links within That Page Will Not Be

* Added to the pending list, as there is on an A Lower path.

* /

Spider = new WebSpider

"http://www.bondibeer.com.au/products/somepub/index.html",

"http://www.bondibeer.com.au/products/", -1); spider.execute ();

/ *

* Check for Pages on the Website That Have Funny

* Jokes or Pictures of Sexy Women.

* /

Spider = new WebSpider ("http://www.bondibeer.com.au/");

Spider.webpageProcessor.ContentHandler =

New WebPageContentDelegate (FunnyJokes);

Spider.webpageProcessor.ContentHandler =

New WebPageContentDelegate (Sexywomen);

Spider.execute ();

Private void funnyjokes (webpagestate state)

{

IF (state.content.indexof ("funny joke")> -1)

{

// do something

}

}

Private void Sexywomen (WebPagestate State)

{

Match m = regexutil.getMatchRegex

RegularExpression.Srcextractor, State.Content);

String image;

While (M.Success)

{

m = m.nextmatch ();

Image = m.groups [1] .tostring () .tolowercase ();

IF (Image.Indexof ("Sexy")> -1 ||

Image.Indexof ("Women")> -1)

{

DownloadImage (Image);

}

}

}

Conclusion

The WebSpider is Flexible ENOUGH To BE Used in a Variety of Useful Scenarios, and Could Be Powerful Tool for Data Mining WebsTes on The Internet and Intranet. I Would Like To Here How People Have Used this code.

Outstanding Issues

........... ..

state.ProcessInstructions - This is really just a quick hack to provide instructions that the content handlers can use as they see fit I am looking for a more elegant solution to this problem MultiThreaded Spider -.. This project 1st started of as a multi threaded spider but that soon fell by the wayside when I found that performance was much slower using threads to process each URI. It seems that the bottle neck is in the GetResponse, which does seem to not run in multiple threads. Valid URI, but the query data that returns a bad page -. The current processor does not handle the scenario where the URI points to a valid page, but the page returned by the webserver is considered to be bad Eg http://www.validhost.com/validpage.. .html? opensubpage = invalidid. One idea to to resolve this problem is to read the contents of a returned page and look for key pieces of information but that technique is a little flakey.About David Cruwys

I have been programming commercially since 1990, with the last 4 years spent mainly in Java. I made the transition to .NET six months ago and have not looked back. I have written e-commerce solutions, desktop and mobile phone applications in a variety of languages ​​(VB6, Delphi, Java, Foxpro, Clipper 87 etc ...) and am currently developing a Web Application Framework in C # .I have just launched www.offyourbutt.com for showcasing my products and services and this will become a test Bed for my c # framework. Click Here to See the View David CRUWYS'S Online Profile.

Introduction

What is a webspider

A WebSpider or crawler is an Automated Program That Follows Links On Websites and Calls A WebRobot To Handle The Contents of Each Link.

What is a webrobot

A WebRobot is a program that processes the content found through a link, a WebRobot can be used for indexing a page or extracting useful information based on a predefined query, common examples are - Link checkers, e-mail address extractors, multimedia extractors and update Watchers.Background

I had a recent contract to build a web page link checker, this component had to be able to check links that were stored in a database as well as to check links on a website, both through the local file system and over the internet.

This article explains the WebRobot, the WebSpider and how to enhance the WebRobot through specialized content handlers, the code shown has some superfluous code such try blocks, variable initialization and minor methods removed.

Class overview

The classes that make up the WebRobot are; WebPageState, which represents a URI and its current state in the process chain and an implementation of IWebPageProcessor, which performs the actual reading of the URI, calling content handlers and dealing with page errors.

The WebSpider has only one class WebSpider, this maintains a list of pending / processed URI's contained in a list of WebPageState objects and runs WebPageProcessor against each WebPageState to extract links to other pages and to test whether the URI's are valid.

Using the code - WebRobot

Web page processing is handled by an object that implements IWebPageProcessor. The Process method expects to receive a WebPageState, this will be updated during page processing and if all is successful the method will return true. Any number of content handlers can be also be called after ...............

Public Delegate void WebPageContentDelegate; Public Interface IWebPageProcessor

{

Bool Process (WebPagestate State);

WebPageContentDelegate ContentHandler {Get; set;}

}

........................ ..

Public Class Webpagestate

{

Private webpagestate () {}

Public WebPageState (URI URI)

{

m_uri = URI;

}

Public WebPageState (String Uri)

: this (new uri (uri)) {}

URI M_URI; // URI to Be Processed

String M_Content; // Content of WebPage

String m_processinstructions = ""; // user defined instructions

// for Content Handlers

Bool M_ProcessStarted = false;

// Becomes True When Processing Starts

Bool M_ProcessSuccessFull = FALSE;

// Becomes True if Process Was Successful

String m_statuscode;

// http status code

String m_statusdescribhip;

// http status description, or exception message

// Standard getters / setters ....

}

The WebPageProcessor is an implementation of the IWebPageProcessor that does the actual work of reading in the content, handling error codes / exceptions and calling the content handlers. WebPageProcessor may be replaced or extended to provide additional functionality, though adding a content handler is generally a better Option.

Public Class WebPageProcessor: iWebPageProcessor

{

Public Bool Process (WebPagestate State)

{

State.ProcessStarted = True;

State.ProcesssuCcessFull = false;

// USE WebRequest.create to Handle Uri's for

// The folowing schemes: file, http & httpswebrequest req = WebRequest.create (state.uri);

WebResponse res = NULL;

Try

{

// Issue a response against the request.

// if any problems are going to happen they

// They area likly to happen here in the form of an exception.

Res = Req.getResponse ();

// if we reach here. Every. Is Likly to Be OK.

IF (RES IS HTTPWEBRESPONSE)

{

State.statuscode =

(Httpwebresponse) res .statuscode.toString ();

State.StatusDescription =

(Httpwebresponse) res) .statusdescription;

}

IF (RES IS FILEBRESPONSE)

{

State.statuscode = "OK";

State.statusDescription = "ok";

}

IF (state.statuscode.equals ("ok"))

{

// read the contents Into yur state

// Object and fire the content handlers

StreamReader SR = New StreamReader

Res.getResponseSstream ());

State.content = sr.readtoend ();

IF (ContentHandler! = NULL)

{

ContentHandler (state);

}

}

State.ProcesssuCcessFull = true;

}

Catch (Exception EX)

{

Handlexception (ex, state);

}

Finally

{

IF (res! = null)

{

res.

}

}

Return State.ProcesssuCcessFull;

}

}

// store ann of Content Handlers

Private WebPageContentDelegate M_ContentHandler = NULL;

Public WebPageContentdeGate ContentHandler

{

Get {return m_contenthandler;}

Set {m_contenthandler = value;}

}

THERE ARE Additonal Private Methods in The WebPageProcessor To Handle Http Error Codes and File NOT FOUND ERRORS WHEN DEALING WITH The "File: //" Scheme As Well as More Severe Excetions.

Using the code - WebSpider

The WebSpider class is really just a harness for calling the WebRobot in a particular way. It provides the robot with a specialized content handler for crawling through web links and maintains a list of both pending pages and already visited pages. The current WebSpider is designed to Start from a given uri and to limit full path.// connuth

//

// Process a URI, UnTil All Links Are Checked,

// only add new links for processing if they

// Point to the Same Host as specified in the starturi.

Public WebSpider

String Starturi

: this (starturi, -1) {}

// AS Above Only Limit The Links to UriprocessedCountMax.

Public WebSpider

String Starturi,

Int uriprocessedcountmax

: This (Starturi, ", UriprocessedCountmax,

False, New WebPageProcessor ()) {}

// AS ABOVE, Except New Links Are Only Added IF

// They is on the path specified by baseuri.

Public WebSpider

String Starturi,

String Baseuri,

Int uriprocessedcountmax

: This (Starturi, Baseuri, Uriprocessedcountmax,

False, New WebPageProcessor ()) {}

// as Above, You can Specify WHETHER THE Web Page

// Content Is Kept After IT IS Processed, B

// Default this Would Be False to Conserve Memory

// WHEN Used on large site.

Public WebSpider

String Starturi,

String Baseuri,

Int uriprocessedcountmax,

Bool KeepWebContent,

IWebPageProcessor WebPageProcessor)

{

// Initialize Web Spider ...

}

Why is there a base path limit? Since there are trillions of pages on the Internet, this spider will check all links that it finds to see if they are valid, but it will only add new links to the pending queue if those links belong within The Context of the Initial Website or Sub Path of That Website.so if we are starting from www.myhost.com/index.html and this page is link to www.myhost.com/pagewithsomelinks.html and www.igoogle.com/ PageWithmanylinks.html dams

.

Public void execute ()

{

AddWebPage (Starturi, Starturi.absoluteuri);

While (WebPagespending.count> 0 &&

(Uriprocessedcountmax == -1 || UriprocessedCount

{

WebPagestate State = (WebPageState) m_webpagespending.dequeue ();

m_webpageProcessor.Process (state);

IF (! KeepWebContent)

{

State.content = null;

}

UriprocessedCount ;

}

}

A Web Page Can Only Be Added to The Queue If The Uri "Excluding Anchor" Points to a path or a valid page (E.g. .html, .aspx, .jsp etc ...) And Has NOT Already Been Seen Before.

Private Bool AddWebpage (Uri Baseuri, String Newuri)

{

URI URI = New URI (Baseuri,

Strutil.LeftIndexof (Newuri, "#")));

IF (! ValidPage (uri.localpath) || m_webpages.contains (URI))

{

Return False;

}

WebPagestate State = New WebPageState (URI);

IF (uri.absoluteuri.startswith (baseuri.absoluteuri) {

State.Processinstructions = "Handle Links";

}

m_webpagespending.enqueue (state);

m_webpages.add (uri, state);

Return True;

}

Examples of Running the Spider

The following code shows three examples for calling the WebSpider, the paths shown are examples only, they do not represent the true structure of this website Note:. The Bondi Beer website in the example, is a site that I built using my own SiteGenerator THIS EASY TO USE Program Produces Static Websites from Dynamic Content Such As ProPrietary Data Files, XML / Xslt Files, Databases, RSS feeds and more ...

/ *

* Check for Broken Links Found On this Website, Limit The Spider To 100 Pages.

* /

Webspider Spider = New Webspider ("http://www.bondibeer.com.au/", 100);

Spider.execute ();

/ *

* Check for Broken Links Found On this Website,

* There is no limited on the number

* of Pages, But it will not look for new links on the NOT LOOK

* Pages That Are Not Within the

* path http://www.bondibeer.com.au/products/. this

* Means That The Home Page Found

* at http://www.bondibeer.com.au/home.html may be

* Checked for EXISTENCE IF IT WAS

* Called from the somepub / index.html but any

* Links within That Page Will Not Be

* Added to the pending list, as there is on an A Lower path.

* /

Spider = new WebSpider

"http://www.bondibeer.com.au/products/somepub/index.html",

"http://www.bondibeer.com.au/products/", -1);

Spider.execute ();

/ *

* Check for Pages on the Website That Have Funny

* Jokes or Pictures of Sexy Women.

* /

Spider = new WebSpider ("http://www.bondibeer.com.au/");

Spider.webpageProcessor.ContentHandler =

New WebPageContentDelegate (FunnyJokes); Spider.webPageProcessor.ContentHandler =

New WebPageContentDelegate (Sexywomen);

Spider.execute ();

Private void funnyjokes (webpagestate state)

{

IF (state.content.indexof ("funny joke")> -1)

{

// do something

}

}

Private void Sexywomen (WebPagestate State)

{

Match m = regexutil.getMatchRegex

RegularExpression.Srcextractor, State.Content);

String image;

While (M.Success)

{

m = m.nextmatch ();

Image = m.groups [1] .tostring () .tolowercase ();

IF (Image.Indexof ("Sexy")> -1 ||

Image.Indexof ("Women")> -1)

{

DownloadImage (Image);

}

}

}

Conclusion

The WebSpider is Flexible ENOUGH To BE Used in a Variety of Useful Scenarios, and Could Be Powerful Tool for Data Mining WebsTes on The Internet and Intranet. I Would Like To Here How People Have Used this code.

Outstanding Issues

........... ..

state.ProcessInstructions - This is really just a quick hack to provide instructions that the content handlers can use as they see fit I am looking for a more elegant solution to this problem MultiThreaded Spider -.. This project 1st started of as a multi threaded spider but that soon fell by the wayside when I found that performance was much slower using threads to process each URI. It seems that the bottle neck is in the GetResponse, which does seem to not run in multiple threads. Valid URI, but the query data that returns a bad page -. The current processor does not handle the scenario where the URI points to a valid page, but the page returned by the webserver is considered to be bad Eg http://www.validhost.com/validpage.. .html? opensubpage = invalidid. One idea to to resolve this problem is to read the contents of a returned page and look for key pieces of information but that technique is a little flakey.About David Cruwys

I have been programming commercially since 1990, with the last 4 years spent mainly in Java. I made the transition to .NET six months ago and have not looked back. I have written e-commerce solutions, desktop and mobile phone applications in a variety of languages ​​(VB6, Delphi, Java, Foxpro, Clipper 87 etc ...) and am currently developing a Web Application Framework in C # .I have just launched www.offyourbutt.com for showcasing my products and services and this will become a test Bed for my c # framework. Click Here to See the View David CRUWYS'S Online Profile.

转载请注明原文地址:https://www.9cbs.com/read-41169.html

New Post(0)