兩個HttpWebRequest和HtmlAgilityPack是無法從表

文本這是原來的功能，我寫來獲得一個網頁的HTML和與用於「IE.document」兩個HttpWebRequest和HtmlAgilityPack是無法從表

代碼工作相同的代碼解析它與一些網站罰款，但現在我得到一個錯誤「doc.write」，我認爲這是因爲網頁有「iso-8859-1」編碼，並在我試圖解析表的第二列中的不同編碼。

Function mWebRe(ByVal mUrl As String) As MSHTML.HTMLDocument 
    Dim request As HttpWebRequest = WebRequest.Create(mUrl) 
    request.Timeout = 10000 
    Dim doc As MSHTML.IHTMLDocument2 = New MSHTML.HTMLDocument 
    Try 
     Dim response As HttpWebResponse = request.GetResponse() 
     'this is the original code 
     'Dim reader As StreamReader = New StreamReader(response.GetResponseStream()) 

     'this is an attempt without effects 
     Dim reader As StreamReader = New StreamReader(response.GetResponseStream(), Encoding.GetEncoding("iso-8859-1")) 
     Dim WebContent As String = reader.ReadToEnd() 'Here the text seems to be 
     doc.clear() 
     doc.write(WebContent) 'Here I get error on loading page 
     doc.close() 

     ' The following is a must do, to make sure that the data is fully load. 
     While (doc.readyState <> "complete") 
      Thread.Sleep(50) 
     End While 

    Catch ex As Exception 
     Return Nothing 
    End Try 
    Return doc 
End Function

我tryed修改代碼，也tryed使用HtmlAgilityPack（我以前從來沒有用過它）沒有成功。

我需要第二個「表」（沒有ID）的內容，所以我寫了下面的代碼（這是不能夠得到的細胞正確的innerText）：

Dim web As HtmlAgilityPack.HtmlWeb = New HtmlWeb() 
    web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1") 
    Dim doc As HtmlAgilityPack.HtmlDocument = web.Load(mUrl) 

    For Each Table As HtmlNode In doc.DocumentNode.SelectNodes("//table") 
     For Each Row As HtmlNode In Table.SelectNodes("//tr") 
      For Each Cell As HtmlNode In Row.SelectNodes("//td") 
       Dim mTxt As String = Cell.InnerText 
      Next 

     Next 
    Next

這是「開始」的網頁源代碼的：

<?xml version="1.0" encoding="iso-8859-1"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

這是行的提取物我想提取：

<tr> 
<td class="tableValues" align="center" valign="top" >Mar 24/12/2013</td> 
<td class="tableValues" align="left" valign="top" >&#73;sc&#114;it&#116;&#111; &#97;&#108; &#82;u&#111;&#108;<!--span-->&#111;<!--i>&#52;</i--></td> 
<td class="tableValues" align="left" valign="top" ></td> 
</tr>

我覺得牛逼第二列有不同的編碼，但我不知道如何將其轉換爲正確的文本。任何建議表示讚賞。

來源

2015-07-02 genespos

我剛剛解決了在htmlAgilityPack代碼中插入下面的代碼。但是，如果任何人都可以提出更好的解決方案，我將不勝感激。

  For Each Cell As HtmlNode In Row.SelectNodes("//td") 
       Dim mTxt As String = Cell.InnerText 
       If mTxt.Contains("&#") Then 
        Dim StrOk As String = WebUtility.HtmlDecode(mTxt) 
        StrOk = Regex.Replace(StrOk, "<!--.+?-->", String.Empty) 
        Debug.Print(StrOk) 
       End If

來源

2015-07-02 18:34:14 genespos

兩個HttpWebRequest和HtmlAgilityPack是無法從表

回答

相關問題