Html解析敏捷

有人可以請幫忙解決在C＃中敏捷解析Html順序標籤的麻煩嗎？我有兩個問題列在下面。Html解析敏捷

在這種情況下，我想解析下面的Html並將它們存儲到結構（列表，堆棧等）中，以便我可以有效地使用這些數據。

<h3> header </h3> 
<p> paragraph 1</p> 
<p> 
<a href="www.google.com">Google</a> 
<a href="www.gizmodo.com">Gizmodo</a> 
</p> 
<ul> 
<li> something is here with a download 
<a href="www.google.com">link</a> 
</li> 
<li> hello 
<img src="www.imagesource.com"/> 
</li> 
</ul>

如何分析在連續的方式處理這些數據？

如果我使用var ParaTags = HtmlDocument.DocumentNode.Descendants("p");, 那麼我只能得到所有的「p」標籤。但我不知道如何依次獲得「h3」，然後「p」，因爲「p」不在「h3」內。

下面的代碼將返回我的所有超鏈接，

var links = 
    from paras in document.DocumentNode.Descendants("p") 
    from hyperLinks in paras.Descendants("a").Where(x => x.Attributes["href"].Value != "") 
    select hyperLinks;

什麼來解析和存儲這些混合內容與字符串，超鏈接和圖像的最佳方法是什麼？因此我可以稍後以有效的方式輸出它們嗎？列表，堆棧？另一個詞，我想存儲來自html的每一個可能的內容，如果可能的話，保留它的格式。所以一旦我將它重新加載到應用程序中，我就可以以適當的格式模仿內容。

謝謝！

來源

2012-08-15 Jerry

目前尚不清楚你想從這個HTML和存儲中提取的信息。你想提取超鏈接的所有'href'屬性嗎？或圖像的'href'和'src'？ – 2012-08-15 07:23:53

我想從該html中獲取所有可能的內容，其中包括h3，所有p，li，href和img src。如果可能的話，格式也是如此。謝謝。 – Jerry 2012-08-15 07:43:41

如果提供內存服務，可以在HtmlDocument類上使用XmlReader，它可以讓您按順序依次讀取每個標記，但我不確定您期望的輸出會爲您提供一些可能的內容重建成確切的Html。 – Pooli 2012-08-15 08:01:01

如果要提取所有href和src屬性，你可以試試這個：

using System; 
using System.Linq; 
using HtmlAgilityPack; 

public class Program 
{ 
    static void Main() 
    { 
     var document = new HtmlDocument(); 
     document.Load("test.html"); 
     var links = 
      from element in document.DocumentNode.Descendants() 
      let href = element.Attributes["href"] 
      let src = element.Attributes["src"] 
      where href != null || src != null 
      select href != null ? href.Value : src.Value; 

     foreach (var link in links) 
     { 
      Console.WriteLine(link); 
     } 
    } 
}

輸出：

www.google.com 
www.gizmodo.com 
www.google.com 
www.imagesource.com

來源

2012-08-15 07:29:19

但我也需要提取h3和p的文本！ – Jerry 2012-08-15 07:41:40

回答

相關問題