在C＃中使用匹配

所以提取字符串的內容的兩個字符串分隔符之間，說我解析以下HTML字符串：在C＃中使用匹配

<html> 
    <head> 
     RANDOM JAVASCRIPT AND CSS AHHHHHH!!!!!!!! 
    </head> 
    <body> 
     <table class="table"> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
      <tr><a href="/subdir/members/Name">Name</a></tr> 
     </table> 
    <body> 
</html>

，我想孤立的**（一切內容表類）

現在的裏面，我用正則表達式來實現這一點：

string pagesource = (method that extracts the html source and stores it into a string); 
string[] splitSource = Regex.Split(pagesource, "<table class=/"member/">; 
string memberList = Regex.Split(splitSource[1], "</table>"); 
//the list of table members will be in memberList[0]; 
//method to extract links from the table 
ExtractLinks(memberList[0]);

我一直在尋找其他的方法來做到這一點萃取 - n，並且我遇到了C＃中的Match對象。

我試圖做這樣的事情：

Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n)*?</table>");

上述的目的是希望提取在兩個分隔符之間的匹配值，但是，當我嘗試運行它匹配值是：

match.value = </table>

我的問題，因此，是：有一個方法來提取從我的字符串，它是稍微容易/更可讀的/比使用正則表達式我的方法更短的數據？對於這個簡單的例子，正則表達式是好的，但對於更復雜的例子，我發現自己的編碼等價於所有屏幕上的塗鴉。

我真的很想使用匹配，因爲它看起來像一個非常整潔的類，但我似乎無法得到它爲我的需求工作。誰能幫我這個？

非常感謝！

來源

2012-06-13 gfppaste

小記：兩個表標記之間的正則表達式部分應該是'（。| \ n）*？'。如果你沒有把括號括在'。\ \ n'中，那麼'*？'只適用於它之前的字符（在這種情況下爲\ n）。 –

[RegEx match open tags not except XHTML self-contained tags]可能重複（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags） – jrummell

[Don '用正則表達式解析HTMl]（http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html） – Shai

使用HTML解析器，如HTML Agility Pack。

var doc = new HtmlDocument(); 

using (var wc = new WebClient()) 
using (var stream = wc.OpenRead(url)) 
{ 
    doc.Load(stream); 
} 

var table = doc.DocumentElement.Element("html").Element("body").Element("table"); 
string tableHtml = table.OuterHtml;

來源

2012-06-13 13:13:11

我其實正在嘗試HTML敏捷包，但缺乏文檔是可怕的！而新的可下載軟件沒有chm，所以，爲了尋求幫助，我基本上查看了可下載軟件隨附的清單......總而言之，它並沒有帶來友好的體驗！ – gfppaste

@gfppaste，沒有真正的文檔需求，這個API非常自我解釋，與Linq to XML非常相似。我學會了使用Intellisense，它非常直觀。 –

您可以使用XPath與HTmlAgilityPack：

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc.LoadHtml(s); 
var elements = doc.DocumentNode.SelectNodes("//table[@class='table']"); 

foreach (var ele in elements) 
{ 
    MessageBox.Show(ele.OuterHtml); 
}

來源

2012-06-13 13:19:49

你必須在正則表達式，以捕捉比賽加括號：

Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n*?)</table>");

無論如何，似乎只有查克諾里斯可以用正則表達式正確解析HTML。

來源

2012-06-13 13:20:59

在C＃中使用匹配

回答

相關問題