2013-01-13 16 views
0

我有這樣一段代碼來加載和使用HtmlAgilityPack解析網頁。它適用於大多數網頁,但我試圖加載日文網頁,似乎編碼是錯誤的。我怎樣才能做到這一點?其實我該如何設置基於網頁編碼的編碼?加載日本網頁與HtmlAgilityPack

class Program { 

    private const string HttpMethod = "GET"; 

    private const string UserAgent = 
     "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7"; 

    static void Main(string[] args) { 
     var request = WebRequest.Create("http://infoseek.co.jp/") as HttpWebRequest; 
     if (request == null) 
      return; 
     request.Method = HttpMethod; 
     request.UserAgent = UserAgent; 
     var response = request.GetResponse() as HttpWebResponse; 
     if (response == null) 
      return; 
     var stream = response.GetResponseStream(); 
     var document = new HtmlDocument { 
      OptionCheckSyntax = true, 
      OptionFixNestedTags = true, 
      OptionAutoCloseOnEnd = true, 
      OptionDefaultStreamEncoding = Encoding.UTF8, 
      OptionReadEncoding = true 
     }; 
     document.Load(stream, Encoding.UTF8); 
     var d = document.DocumentNode; 
    } 
} 

回答

0

我試圖通過下面的代碼從HttpWebResponse對象獲取編碼。你有沒有看到任何問題或有其他想法?

class Program { 

    private const string HttpMethod = "GET"; 

    private const string UserAgent = 
     "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7"; 

    static void Main(string[] args) { 
     var request = WebRequest.Create("http://infoseek.co.jp/") as HttpWebRequest; 
     if (request == null) 
      return; 
     request.Method = HttpMethod; 
     request.UserAgent = UserAgent; 
     var response = request.GetResponse() as HttpWebResponse; 
     if (response == null) 
      return; 
     var encoding = TryGetEncoding(response); 
     var stream = response.GetResponseStream(); 
     var document = new HtmlDocument { 
      OptionCheckSyntax = true, 
      OptionFixNestedTags = true, 
      OptionAutoCloseOnEnd = true, 
      OptionReadEncoding = true, 
      OptionDefaultStreamEncoding = encoding 
     }; 
     document.Load(stream, encoding); 
     var d = document.DocumentNode; 
    } 

    private static Encoding TryGetEncoding(HttpWebResponse response) { 
     var charset = response.CharacterSet; 
     if (string.IsNullOrWhiteSpace(charset)) 
      charset = response.ContentEncoding; 
     if (string.IsNullOrWhiteSpace(charset)) 
      return Encoding.UTF8; 
     try { 
      return Encoding.GetEncoding(charset); 
     } catch { 
      return Encoding.UTF8; 
     } 
    } 
} 
0

infoseek.co.jp與HTTP頭

Content-Type text/html; charset=EUC-JP 

其在HTML標籤

<meta http-equiv="Content-Type" content="text/html; charset=EUC-JP"> 

在.net鏡像之外,使用Code Page 51932到EUC-JP解碼響應。

+0

是的,我知道。但問題是這個標籤可以在文檔加載後訪問,但是在加載文檔時需要編碼。請問你有什麼想法嗎? –

+0

看看在這個線程http://htmlagilitypack.codeplex.com/discussions/60174最後的答案。它使用System.Net.WebClient以字符串形式檢索頁面,然後傳遞該字符串以創建HtmlDocument – devio