Can't find a good way, how to get the eNCoding of the page via HttpWebRequest. (turn)

xiaoxiao2021-03-06 72

Today is a simple web crawler, with .NET's HTTPWebRequest to get page information, then get the connection on the page via the regex, recursively crawled, start to http://blog.sunmast.com (Encoding is UTF-8) No mistakes were found when climbed. The crawling is normal. As a result, when climbing other Chinese websites (Encoding is GB2312), it is garbled, which is inevitable, so modified code is:

HttpWebRequest req = (HttpWebRequest) WebRequest.Create ( "http://www.163.com"); req.AllowAutoRedirect = true; req.MaximumAutomaticRedirections = 3; req.UserAgent = "Mozilla / 6.0 (MSIE 6.0; Windows NT 5.1 ; Natas.Robot) "; req.KeepAlive = true; req.Timeout = 4000; // Get the stream from the returned web response HttpWebResponse webresponse = null; try {webresponse = (HttpWebResponse) req.GetResponse ();} catch ( System.Net.WebException ex) {string message = "error response exception:" ex.Message; Console.WriteLine (message);} if (webresponse = null) {StreamReader stream = new StreamReader (webresponse.GetResponseStream (!), Encoding.getencoding ("GB2312")); // Todo ...}

However, with such a program to acquire UTF-8 encoded Chinese websites, it will become garbled. View MSDN, see HttpWebResponse has ContenTenCoding and CharacterSet, which hopes to make StreamReader encoding based on the encoded type of the obtained web page. So the encoding test, but it was found that many websites (including Microsoft, Sina), but they could not obtain these two parameters. The results of the output were string.empty. Searching on such a problem on Google and Baidu, find two Article in CodeProject, but all the httpwebresponse.contentencoding property gets the page encoding, but why is the empty string?

转载请注明原文地址:https://www.9cbs.com/read-83257.html

9cbs

New Post(0)