0

文本這是原來的功能,我寫來獲得一個網頁的HTML和與用於「IE.document」兩個HttpWebRequest和HtmlAgilityPack是無法從表

代碼工作相同的代碼解析它與一些網站罰款,但現在我得到一個錯誤「doc.write」,我認爲這是因爲網頁有「iso-8859-1」編碼,並在我試圖解析表的第二列中的不同編碼。

Function mWebRe(ByVal mUrl As String) As MSHTML.HTMLDocument 
    Dim request As HttpWebRequest = WebRequest.Create(mUrl) 
    request.Timeout = 10000 
    Dim doc As MSHTML.IHTMLDocument2 = New MSHTML.HTMLDocument 
    Try 
     Dim response As HttpWebResponse = request.GetResponse() 
     'this is the original code 
     'Dim reader As StreamReader = New StreamReader(response.GetResponseStream()) 

     'this is an attempt without effects 
     Dim reader As StreamReader = New StreamReader(response.GetResponseStream(), Encoding.GetEncoding("iso-8859-1")) 
     Dim WebContent As String = reader.ReadToEnd() 'Here the text seems to be 
     doc.clear() 
     doc.write(WebContent) 'Here I get error on loading page 
     doc.close() 

     ' The following is a must do, to make sure that the data is fully load. 
     While (doc.readyState <> "complete") 
      Thread.Sleep(50) 
     End While 

    Catch ex As Exception 
     Return Nothing 
    End Try 
    Return doc 
End Function 

我tryed修改代碼,也tryed使用HtmlAgilityPack(我以前從來沒有用過它)沒有成功。

我需要第二個「表」(沒有ID)的內容,所以我寫了下面的代碼(這是不能夠得到的細胞正確的innerText):

Dim web As HtmlAgilityPack.HtmlWeb = New HtmlWeb() 
    web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1") 
    Dim doc As HtmlAgilityPack.HtmlDocument = web.Load(mUrl) 

    For Each Table As HtmlNode In doc.DocumentNode.SelectNodes("//table") 
     For Each Row As HtmlNode In Table.SelectNodes("//tr") 
      For Each Cell As HtmlNode In Row.SelectNodes("//td") 
       Dim mTxt As String = Cell.InnerText 
      Next 

     Next 
    Next 

這是「開始」的網頁源代碼的:

<?xml version="1.0" encoding="iso-8859-1"?> 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 

這是行的提取物我想提取:

<tr> 
<td class="tableValues" align="center" valign="top" >Mar 24/12/2013</td> 
<td class="tableValues" align="left" valign="top" >&#73;sc&#114;it&#116;&#111; &#97;&#108; &#82;u&#111;&#108;<!--span-->&#111;<!--i>&#52;</i--></td> 
<td class="tableValues" align="left" valign="top" ></td> 
</tr> 

我覺得牛逼第二列有不同的編碼,但我不知道如何將其轉換爲正確的文本。 任何建議表示讚賞。

回答

0

我剛剛解決了在htmlAgilityPack代碼中插入下面的代碼。 但是,如果任何人都可以提出更好的解決方案,我將不勝感激。

  For Each Cell As HtmlNode In Row.SelectNodes("//td") 
       Dim mTxt As String = Cell.InnerText 
       If mTxt.Contains("&#") Then 
        Dim StrOk As String = WebUtility.HtmlDecode(mTxt) 
        StrOk = Regex.Replace(StrOk, "<!--.+?-->", String.Empty) 
        Debug.Print(StrOk) 
       End If