2015-04-16 28 views
0

我必須將網頁下載到文本文件並分析單詞。字符編碼下載網頁

他們是在型動物chatsets,ISO-8859-1,窗口1252 ...我從那麼喜歡thisthis,更嘗試了幾種解決方案,但他們沒有工作,我還在讀書米& #xED; nimo(當然沒有空格)我應該在哪裏閱讀mínimo或M & e銳; xico

有人可以讓我的方式正確嗎?謝謝!

public static string DownloadString(string address) 
{ 
    string strWebPage = ""; 
    // create request 
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address); 
    // get response 
    System.Net.HttpWebResponse objResponse; 
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse(); 
    // get correct charset and encoding from the server's header 
    string Charset = objResponse.CharacterSet; 
    Encoding encoding = Encoding.GetEncoding(Charset); 

    // read response into memory stream 
    MemoryStream memoryStream; 
    using (Stream responseStream = objResponse.GetResponseStream()) 
    { 
     memoryStream = new MemoryStream(); 

     byte[] buffer = new byte[1024]; 
     int byteCount; 
     do 
     { 
      byteCount = responseStream.Read(buffer, 0, buffer.Length); 
      memoryStream.Write(buffer, 0, byteCount); 
     } while (byteCount > 0); 
    } 

    // set stream position to beginning 
    memoryStream.Seek(0, SeekOrigin.Begin); 

    StreamReader sr = new StreamReader(memoryStream, encoding); 
    strWebPage = sr.ReadToEnd(); 

    // Check real charset meta-tag in HTML 
    int CharsetStart = strWebPage.IndexOf("charset="); 
    if (CharsetStart > 0) 
    { 
     CharsetStart += 8; 
     int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart); 
     string RealCharset = 
       strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart); 

     // real charset meta-tag in HTML differs from supplied server header??? 
     if (RealCharset != Charset) 
     { 
      // get correct encoding 
      Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset); 

      // reset stream position to beginning 
      memoryStream.Seek(0, SeekOrigin.Begin); 

      // reread response stream with the correct encoding 
      StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding); 

      strWebPage = sr2.ReadToEnd(); 
      // Close and clean up the StreamReader 
      sr2.Close(); 
     } 
    } 

    // dispose the first stream reader object 
    sr.Close(); 

    return strWebPage; 
} 

回答

0

它不與編碼的問題,像í那些滑稽的字符串被稱爲HTML entities

轉換爲正確編碼後使用HttpUtility.HttpDecode(來自System.Web程序集)轉換爲html實體。

+0

太簡單了,我自己也找不到。謝謝! – Atsu