使用正則表達式來獲取多個HTML標籤之間的文本

使用正則表達式，我想能夠獲得多個DIV標籤之間的文本。例如，以下內容：使用正則表達式來獲取多個HTML標籤之間的文本

<div>first html tag</div> 
<div>another tag</div>

將輸出：

first html tag 
another tag

我使用的正則表達式模式的匹配我的最後一個div標籤，並錯過了第一個。代碼：

static void Main(string[] args) 
    { 
     string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>"; 
     string pattern = "(<div.*>)(.*)(<\\/div>)"; 

     MatchCollection matches = Regex.Matches(input, pattern); 
     Console.WriteLine("Matches found: {0}", matches.Count); 

     if (matches.Count > 0) 
      foreach (Match m in matches) 
       Console.WriteLine("Inner DIV: {0}", m.Groups[2]); 

     Console.ReadLine(); 
    }

輸出：發現

相符：1

內DIV：這是另一個考驗

來源

2013-04-14 ben

是勢在必行這個任務，你使用正則表達式？ HTML是一種上下文無關語法，不能用正則表達式進行分析。通常情況下，您可以關閉，但使用HTML解析器會更好。請參閱http://stackoverflow.com/a/1732454/2022565 –

與非貪婪匹配

static void Main(string[] args) 
{ 
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>"; 
    string pattern = "<div.*?>(.*?)<\\/div>"; 

    MatchCollection matches = Regex.Matches(input, pattern); 
    Console.WriteLine("Matches found: {0}", matches.Count); 

    if (matches.Count > 0) 
     foreach (Match m in matches) 
      Console.WriteLine("Inner DIV: {0}", m.Groups[1]); 

    Console.ReadLine(); 
}

更換您的模式

來源

2013-04-14 23:19:07 coolmine

它發現兩個匹配，但在我的程序上顯示空值（s） – ben

上面的代碼應該工作，請注意它的m.Groups [1]而不是m .Groups [2]，因爲我沒有理由捕獲標籤本身。 http://www.rubular.com/r/XQrcobmfAK – coolmine

首先記住r在HTML文件中，您將有一個新的行符號（「\ n」），您沒有將其包含在用來檢查您的正則表達式的字符串中。

採取你二的正則表達式：

((<div.*>)(.*)(<\\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag. 

((<div.*>)(.*)(<\\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

也是一個很好的地方去尋找這類信息：

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

Mayman

來源

2013-04-14 23:20:19 Mayman

短版本就是你在所有情況下都無法正確執行此操作。總是會出現一些有效的HTML格式，因此正則表達式將無法提取您想要的信息。

原因是因爲HTML是一種上下文無關語法，它比正則表達式更復雜。

下面是一個示例 - 如果您有多個堆疊的div，該怎麼辦？

<div><div>stuff</div><div>stuff2</div></div>

列爲其他的答案的正則表達式會搶：

<div><div>stuff</div> 
<div>stuff</div> 
<div>stuff</div><div>stuff2</div> 
<div>stuff</div><div>stuff2</div></div> 
<div>stuff2</div> 
<div>stuff2</div></div>

，因爲這是當他們試圖解析HTML正則表達式做。

你不能寫一個正則表達式來理解如何解釋所有的情況，因爲正則表達式不能這樣做。如果你正在處理一組非常特定的HTML，這可能是可能的，但是你應該記住這個事實。

來源

2013-04-14 23:28:30

你看了Html Agility Pack（見https://stackoverflow.com/a/857926/618649）？

CsQuery也看起來很有用（基本上使用CSS選擇器風格的語法來獲取元素）。請參閱https://stackoverflow.com/a/11090816/618649。

CsQuery基本上是「jQuery for C＃」，它幾乎是我用來找到它的確切搜索條件。

如果你可以在網絡瀏覽器中做到這一點，你可以很容易地使用jQuery，使用類似於$("div").each(function(idx){ alert(idx + ": " + $(this).text()); }的語法（只有你明顯地將結果輸出到日誌或屏幕上，或者使用它進行web服務調用，或者你需要做的任何事情）。

來源

2013-04-15 01:55:31 Craig

downvote沒有任何解釋或評論。謝謝！事實是，HTML/XML在處理使用正則表達式方面非常痛苦。並不是說你無法做到這一點，而且我的確有很多場合，但CSS選擇器語法是一個更清晰的命題。 – Craig

我覺得這個代碼應工作：

string htmlSource = "<div>first html tag</div><div>another tag</div>"; 
string pattern = @"<div[^>]*?>(.*?)</div>"; 
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
ArrayList l = new ArrayList(); 
foreach (Match match in matches) 
{ 
    l.Add(match.Groups[1].Value); 
}

來源

2014-07-15 03:12:09

至於其他球員並沒有提到HTML tags with attributes，這裏是我的解決方案來處理是：

// <TAG(.*?)>(.*?)</TAG> 
// Example 
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>"); 
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!"); 
Console.Write(m.Groups[2].Value); // will print -> World

來源

2016-10-01 11:58:41

使用正則表達式來獲取多個HTML標籤之間的文本

回答

相關問題