使用Html敏捷包抓取文本內容

我會盡我所能來具體。基本上在vb.net上的一個爬蟲，我更有興趣提取頁面的文本內容。我目前的應用程序通過使用Web瀏覽器控件如下下載在一個文本框HTML源代碼的身體：使用Html敏捷包抓取文本內容

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click 
    Dim url As String = "<url>" 
    WebBrowser1.Navigate(url) 
End Sub 

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Forms.WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted 
    TextBox2.Text = WebBrowser1.Document.Body.OuterHtml 
End Sub

從這裏

現在，TextBox2中包括其中包含的HREF，IMG，廣告，腳本等垃圾HTML，但我需要來獲取所有這些元數據並獲取純文本。

我可以申請正則表達式屬性來獲取所有異常，但我認爲HAP更適合於html解析器。

搜索在這裏把我帶到這個網頁，其中討論了由提到使用白名單技術的「熔燬」

HTML Agility Pack strip tags NOT IN whitelist

但是我怎麼應用它在vb.net，因爲它似乎是一個好想法？

請書於人..........

編輯：我找到了代碼的vb.net版本如下圖所示，但似乎是一個錯誤在

If i IsNot DeletableNodesXpath.Count - 1 Then

錯誤：IsNot運算需要操作數具有引用類型，但該操作數具有值類型整數

下面是代碼：

公共NotInheritable類HtmlSanitizer 私人小組新（）完子私人共享的只讀白名單作爲IDictionary的（字符串，字符串（））私人共享DeletableNodesXpath作爲新的列表（字符串）（）

Shared Sub New() 
    Whitelist = New Dictionary(Of String, String())() From { _ 
     {"a", New() {"href"}}, _ 
     {"strong", Nothing}, _ 
     {"em", Nothing}, _ 
     {"blockquote", Nothing}, _ 
     {"b", Nothing}, _ 
     {"p", Nothing}, _ 
     {"ul", Nothing}, _ 
     {"ol", Nothing}, _ 
     {"li", Nothing}, _ 
     {"div", New() {"align"}}, _ 
     {"strike", Nothing}, _ 
     {"u", Nothing}, _ 
     {"sub", Nothing}, _ 
     {"sup", Nothing}, _ 
     {"table", Nothing}, _ 
     {"tr", Nothing}, _ 
     {"td", Nothing}, _ 
     {"th", Nothing} _ 
    } 
End Sub 

Public Shared Function Sanitize(input As String) As String 
    If input.Trim().Length < 1 Then 
     Return String.Empty 
    End If 
    Dim htmlDocument = New HtmlDocument() 

    htmldocument.LoadHtml(input) 
    SanitizeNode(htmldocument.DocumentNode) 
    Dim xPath As String = HtmlSanitizer.CreateXPath() 

    Return StripHtml(htmldocument.DocumentNode.WriteTo().Trim(), xPath) 
End Function 

Private Shared Sub SanitizeChildren(parentNode As HtmlNode) 
    For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1 
     SanitizeNode(parentNode.ChildNodes(i)) 
    Next 
End Sub 

Private Shared Sub SanitizeNode(node As HtmlNode) 
    If node.NodeType = HtmlNodeType.Element Then 
     If Not Whitelist.ContainsKey(node.Name) Then 
      If Not DeletableNodesXpath.Contains(node.Name) Then 
       'DeletableNodesXpath.Add(node.Name.Replace("?","")); 
       node.Name = "removeableNode" 
       DeletableNodesXpath.Add(node.Name) 
      End If 
      If node.HasChildNodes Then 
       SanitizeChildren(node) 
      End If 

      Return 
     End If 

     If node.HasAttributes Then 
      For i As Integer = node.Attributes.Count - 1 To 0 Step -1 
       Dim currentAttribute As HtmlAttribute = node.Attributes(i) 
       Dim allowedAttributes As String() = Whitelist(node.Name) 
       If allowedAttributes IsNot Nothing Then 
        If Not allowedAttributes.Contains(currentAttribute.Name) Then 
         node.Attributes.Remove(currentAttribute) 
        End If 
       Else 
        node.Attributes.Remove(currentAttribute) 
       End If 
      Next 
     End If 
    End If 

    If node.HasChildNodes Then 
     SanitizeChildren(node) 
    End If 
End Sub 

Private Shared Function StripHtml(html As String, xPath As String) As String 
    Dim htmlDoc As New HtmlDocument() 
    htmlDoc.LoadHtml(html) 
    If xPath.Length > 0 Then 
     Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath) 
     For Each node As HtmlNode In invalidNodes 
      node.ParentNode.RemoveChild(node, True) 
     Next 
    End If 
    Return htmlDoc.DocumentNode.WriteContentTo() 


End Function 

Private Shared Function CreateXPath() As String 
    Dim _xPath As String = String.Empty 
    For i As Integer = 0 To DeletableNodesXpath.Count - 1 
     If i IsNot DeletableNodesXpath.Count - 1 Then 
      _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString()) 
     Else 
      _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString()) 
     End If 
    Next 
    Return _xPath 
End Function 
End Class

請有人幫忙??????

來源

2011-07-26 Kevin

你有沒有試過C＃VB轉換器？ –

嗨消融，是的，我發現你的VB版本在這裏 [鏈接]（http://stackoverflow.com/questions/3140919/stripping-all-html-tags-with-html-agility-pack），但它給錯誤的這行**如果我不是DeletableNodesXpath.Count - 1然後**，關於** IsNot需要具有引用類型的操作數，但此操作數具有值類型integer ** – Kevin

此處的C＃版本http://htmlagilitypack.codeplex.com /討論/ 215674＃post460616 –

而不是使用IsNot，只需用<>。當你基本上檢查一個整數的值不等於另一個整數的值 - 1.

我相信IsNot不能用於整數。

編輯： 我只是注意到這是超級超級老。剛剛看到7月26日的日期！

來源

2012-07-26 09:07:58 ianbailey

使用Html敏捷包抓取文本內容

回答

相關問題