Spider's C # implementation

xiaoxiao2021-03-06 157

I have seen a lot of introductions about Spider, Crawler, Robot, idle boring yourself, try; perfect implementation may not be stigted;

SPIDER's implementation of several basic methods;

1: Get web content according to the URL address;

2: Get all URL addresses it contains according to web content;

One is two methods;

-------------------------------------------------- ---

// Get web content; public static string getHtmlContent (string url) {string resultStr = string.Empty; System.Net.HttpWebRequest hreq = null; System.Net.HttpWebResponse hrep = null; Stream stream = null; StreamReader sReader = null; try {hreq = (HttpWebRequest) WebRequest.Create (url); hrep = (HttpWebResponse) hreq.GetResponse (); stream = hrep.GetResponseStream (); sReader = new StreamReader (stream, System.Text.Encoding.Default); resultStr = Sreader.ReadToEnd (); finally {sreader.close (); stream.close (); hRep.close ();} return resultstr

}

2: // Get the page over the connection address; public static ArrayList getHttpUrlList (string page, string curUrl, int index_s) {ArrayList urlList = new ArrayList (25); Regex r; string urlStr = string.Empty; try {r = NEW regex ("(? <= // s href // s * =) // s * (?: (? /" // w * / ") | (? [^> / / s] *) ")"); matchcollection mc1 = r.matches (page); urllist.clear (); foreach (match m1 in mc1) {urlstr = completeurl (m1.value, cururl); if (! Urllist.contains (URLSTR)) URLLST.ADD (URLSTR);}}} Catch (Exception E) {MessageBox.show ("Error" E.MESSAGE); Return Null;} Return Urllist;} // Standard URL address; Private Static String CompleteURL (String Oldurl) {// 1 Oldurl = Oldurl.Replace ("/" "," "). TOLOWER (); OldURL = Oldurl.Replace (" '"," "); // 2 if (! Oldurl.tolower (). StartSwith ("http:")) OldURL = Cururl "/" Oldurl; // 3 Oldurl = Oldurl.Replace ("http: //", "); Oldurl = Oldurl .Replace ("http: //", ""); oldurl = Oldurl.trim (); returno} ------------------------ ------------ ----------------------------

Places where there is a question:

1: streamreader (stream, system.text.encoding.default); After obtaining the webflow data, use encoding.default to decode (quite with local GB2312), but touches BIG5 and other character sets will be garbled; I haven't found a good solution yet;

2: The regular expression of the URL is not ideal for test .Net help, therefore, I have modified; also need to test;

转载请注明原文地址:https://www.9cbs.com/read-126250.html

9cbs

New Post(0)