I have seen a lot of introductions about Spider, Crawler, Robot, idle boring yourself, try; perfect implementation may not be stigted;
SPIDER's implementation of several basic methods;
1: Get web content according to the URL address;
2: Get all URL addresses it contains according to web content;
One is two methods;
-------------------------------------------------- ---
// Get web content; public static string getHtmlContent (string url) {string resultStr = string.Empty; System.Net.HttpWebRequest hreq = null; System.Net.HttpWebResponse hrep = null; Stream stream = null; StreamReader sReader = null; try {hreq = (HttpWebRequest) WebRequest.Create (url); hrep = (HttpWebResponse) hreq.GetResponse (); stream = hrep.GetResponseStream (); sReader = new StreamReader (stream, System.Text.Encoding.Default); resultStr = Sreader.ReadToEnd (); finally {sreader.close (); stream.close (); hRep.close ();} return resultstr
}
2: // Get the page over the connection address; public static ArrayList getHttpUrlList (string page, string curUrl, int index_s) {ArrayList urlList = new ArrayList (25); Regex r; string urlStr = string.Empty; try {r = NEW regex ("(? <= // s href // s * =) // s * (?: (?
Places where there is a question:
1: streamreader (stream, system.text.encoding.default); After obtaining the webflow data, use encoding.default to decode (quite with local GB2312), but touches BIG5 and other character sets will be garbled; I haven't found a good solution yet;
2: The regular expression of the URL is not ideal for test .Net help, therefore, I have modified; also need to test;