0
文本這是原來的功能,我寫來獲得一個網頁的HTML和與用於「IE.document」兩個HttpWebRequest和HtmlAgilityPack是無法從表
代碼工作相同的代碼解析它與一些網站罰款,但現在我得到一個錯誤「doc.write」,我認爲這是因爲網頁有「iso-8859-1」編碼,並在我試圖解析表的第二列中的不同編碼。
Function mWebRe(ByVal mUrl As String) As MSHTML.HTMLDocument
Dim request As HttpWebRequest = WebRequest.Create(mUrl)
request.Timeout = 10000
Dim doc As MSHTML.IHTMLDocument2 = New MSHTML.HTMLDocument
Try
Dim response As HttpWebResponse = request.GetResponse()
'this is the original code
'Dim reader As StreamReader = New StreamReader(response.GetResponseStream())
'this is an attempt without effects
Dim reader As StreamReader = New StreamReader(response.GetResponseStream(), Encoding.GetEncoding("iso-8859-1"))
Dim WebContent As String = reader.ReadToEnd() 'Here the text seems to be
doc.clear()
doc.write(WebContent) 'Here I get error on loading page
doc.close()
' The following is a must do, to make sure that the data is fully load.
While (doc.readyState <> "complete")
Thread.Sleep(50)
End While
Catch ex As Exception
Return Nothing
End Try
Return doc
End Function
我tryed修改代碼,也tryed使用HtmlAgilityPack(我以前從來沒有用過它)沒有成功。
我需要第二個「表」(沒有ID)的內容,所以我寫了下面的代碼(這是不能夠得到的細胞正確的innerText):
Dim web As HtmlAgilityPack.HtmlWeb = New HtmlWeb()
web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1")
Dim doc As HtmlAgilityPack.HtmlDocument = web.Load(mUrl)
For Each Table As HtmlNode In doc.DocumentNode.SelectNodes("//table")
For Each Row As HtmlNode In Table.SelectNodes("//tr")
For Each Cell As HtmlNode In Row.SelectNodes("//td")
Dim mTxt As String = Cell.InnerText
Next
Next
Next
這是「開始」的網頁源代碼的:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
這是行的提取物我想提取:
<tr>
<td class="tableValues" align="center" valign="top" >Mar 24/12/2013</td>
<td class="tableValues" align="left" valign="top" >Iscritto al Ruol<!--span-->o<!--i>4</i--></td>
<td class="tableValues" align="left" valign="top" ></td>
</tr>
我覺得牛逼第二列有不同的編碼,但我不知道如何將其轉換爲正確的文本。 任何建議表示讚賞。