我會盡我所能來具體。基本上在vb.net上的一個爬蟲,我更有興趣提取頁面的文本內容。我目前的應用程序通過使用Web瀏覽器控件如下下載在一個文本框HTML源代碼的身體:使用Html敏捷包抓取文本內容
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim url As String = "<url>"
WebBrowser1.Navigate(url)
End Sub
Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
TextBox2.Text = WebBrowser1.Document.Body.OuterHtml
End Sub
從這裏
現在,TextBox2中包括其中包含的HREF,IMG,廣告,腳本等垃圾HTML,但我需要 來獲取所有這些元數據並獲取純文本。
我可以申請正則表達式屬性來獲取所有異常,但我認爲HAP更適合於html解析器。
搜索在這裏把我帶到這個網頁,其中討論了由提到使用白名單技術的「熔燬」
HTML Agility Pack strip tags NOT IN whitelist
但是我怎麼應用它在vb.net,因爲它似乎是一個好想法?
請書於人..........
編輯:我找到了代碼的vb.net版本如下圖所示,但似乎是一個錯誤在
If i IsNot DeletableNodesXpath.Count - 1 Then
錯誤:IsNot運算需要操作數具有引用類型,但該操作數具有值類型整數
下面是代碼:
公共NotInheritable類HtmlSanitizer 私人小組新() 完子 私人共享的只讀白名單作爲IDictionary的(字符串,字符串()) 私人共享DeletableNodesXpath作爲新的列表(字符串)()
Shared Sub New()
Whitelist = New Dictionary(Of String, String())() From { _
{"a", New() {"href"}}, _
{"strong", Nothing}, _
{"em", Nothing}, _
{"blockquote", Nothing}, _
{"b", Nothing}, _
{"p", Nothing}, _
{"ul", Nothing}, _
{"ol", Nothing}, _
{"li", Nothing}, _
{"div", New() {"align"}}, _
{"strike", Nothing}, _
{"u", Nothing}, _
{"sub", Nothing}, _
{"sup", Nothing}, _
{"table", Nothing}, _
{"tr", Nothing}, _
{"td", Nothing}, _
{"th", Nothing} _
}
End Sub
Public Shared Function Sanitize(input As String) As String
If input.Trim().Length < 1 Then
Return String.Empty
End If
Dim htmlDocument = New HtmlDocument()
htmldocument.LoadHtml(input)
SanitizeNode(htmldocument.DocumentNode)
Dim xPath As String = HtmlSanitizer.CreateXPath()
Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath)
End Function
Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
SanitizeNode(parentNode.ChildNodes(i))
Next
End Sub
Private Shared Sub SanitizeNode(node As HtmlNode)
If node.NodeType = HtmlNodeType.Element Then
If Not Whitelist.ContainsKey(node.Name) Then
If Not DeletableNodesXpath.Contains(node.Name) Then
'DeletableNodesXpath.Add(node.Name.Replace("?",""));
node.Name = "removeableNode"
DeletableNodesXpath.Add(node.Name)
End If
If node.HasChildNodes Then
SanitizeChildren(node)
End If
Return
End If
If node.HasAttributes Then
For i As Integer = node.Attributes.Count - 1 To 0 Step -1
Dim currentAttribute As HtmlAttribute = node.Attributes(i)
Dim allowedAttributes As String() = Whitelist(node.Name)
If allowedAttributes IsNot Nothing Then
If Not allowedAttributes.Contains(currentAttribute.Name) Then
node.Attributes.Remove(currentAttribute)
End If
Else
node.Attributes.Remove(currentAttribute)
End If
Next
End If
End If
If node.HasChildNodes Then
SanitizeChildren(node)
End If
End Sub
Private Shared Function StripHtml(html As String, xPath As String) As String
Dim htmlDoc As New HtmlDocument()
htmlDoc.LoadHtml(html)
If xPath.Length > 0 Then
Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
For Each node As HtmlNode In invalidNodes
node.ParentNode.RemoveChild(node, True)
Next
End If
Return htmlDoc.DocumentNode.WriteContentTo()
End Function
Private Shared Function CreateXPath() As String
Dim _xPath As String = String.Empty
For i As Integer = 0 To DeletableNodesXpath.Count - 1
If i IsNot DeletableNodesXpath.Count - 1 Then
_xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
Else
_xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
End If
Next
Return _xPath
End Function
End Class
請有人幫忙??????
你有沒有試過C#VB轉換器? –
嗨消融,是的,我發現你的VB版本在這裏 [鏈接](http://stackoverflow.com/questions/3140919/stripping-all-html-tags-with-html-agility-pack),但它給錯誤的這行**如果我不是DeletableNodesXpath.Count - 1然後**,關於** IsNot需要具有引用類型的操作數,但此操作數具有值類型integer ** – Kevin
此處的C#版本http://htmlagilitypack.codeplex.com /討論/ 215674#post460616 –