HtmlAgilityPack解析文本塊

我正在製作一個小型Web分析工具，需要以某種方式提取給定URL上包含超過X個單詞的所有文本塊。HtmlAgilityPack解析文本塊

我目前使用的方法是這樣的：

 public string getAllText(string _html) 
    { 
     string _allText = ""; 
     try 
     { 
      HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument(); 
      document.LoadHtml(_html); 


      var root = document.DocumentNode; 
      var sb = new StringBuilder(); 
      foreach (var node in root.DescendantNodesAndSelf()) 
      { 
       if (!node.HasChildNodes) 
       { 
        string text = node.InnerText; 
        if (!string.IsNullOrEmpty(text)) 
         sb.AppendLine(text.Trim()); 
       } 
      } 

      _allText = sb.ToString(); 

     } 
     catch (Exception) 
     { 
     } 

     _allText = System.Web.HttpUtility.HtmlDecode(_allText); 

     return _allText; 
    }

這裏的問題是，我得到的所有文字回來，即使它是一個MENY文本，有3個字頁腳文本等

我想分析頁面上的實際內容，所以我的想法是以某種方式解析可能是內容的文本（即，具有多於X個字的文本塊）

任何想法如何實現？

來源

2012-11-17 Jacqueline

你可以發佈你的html嗎？ – Karthik

html會有所不同，有些頁面可能會將文本以p， – Jacqueline

嗯，第一種方法可以是使用string.Split功能上的每個node.InnerText值的簡單字數analisys：

string[] words; 
words = text.Split((string[]) null, StringSplitOptions.RemoveEmptyEntries);

和追加只有在words.Length大於3

文本也看到this question answer一些原始文本收集中的更多技巧。

來源

2012-11-17 08:56:07

HtmlAgilityPack解析文本塊

回答

相關問題