如何使用RegEx從HTML中提取值？

考慮下面的HTML：如何使用RegEx從HTML中提取值？

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq: <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

我想獲得元素中的值。我還想獲得元素上class屬性的值。

理想情況下，我可以通過函數運行一些HTML並獲取提取實體的字典（基於上面定義的解析）。

上述代碼是來自較大源HTML文件的代碼片段，它無法與XML解析器進行比較。所以我正在尋找一個可能的正則表達式來幫助提取感興趣的信息。

來源

2011-03-16 Paul Fryer

什麼編程語言是您使用？有一些庫會採用HTML不是有效的XML，並且仍允許使用xpath表達式等來查詢信息。 – 2011-03-16 15:26:37

編程語言= .net – 2011-03-16 15:32:40

使用該工具（免費）： http://www.radsoftware.com.au/regexdesigner/

使用這個表達式：

"<span[^>]*>(.*?)</span>"

在組1（每場比賽）的值將是你所需要的文本。

在C＃中它會看起來像：

  Regex regex = new Regex("<span[^>]*>(.*?)</span>"); 
      string toMatch = "<span class=\"ajjsjs\">Some text</span>"; 
      if (regex.IsMatch(toMatch)) 
      { 
       MatchCollection collection = regex.Matches(toMatch); 
       foreach (Match m in collection) 
       { 
        string val = m.Groups[1].Value; 
        //Do something with the value 
       } 
      }

Ammended回答評論：

  Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>"); 
      string toMatch = "<span class=\"ajjsjs\">Some text</span>"; 
      if (regex.IsMatch(toMatch)) 
      { 
       MatchCollection collection = regex.Matches(toMatch); 
       foreach (Match m in collection) 
       { 
        string class = m.Groups[1].Value; 
        string val = m.Groups[2].Value; 
        //Do something with the class and value 
       } 
      }

來源

2011-03-16 15:53:22

我的示例代碼不適用於嵌套跨度，但是然後再次沒有在您提供的示例html中。 – 2011-03-16 16:03:57

這適用於獲取值，謝謝。你有什麼想法，我怎麼能得到「類」屬性的價值呢？ – 2011-03-16 16:09:36

這正是我正在尋找的 - 你搖滾！謝謝 – 2011-03-16 16:21:41

假設你有沒有嵌套 span標籤，下面應該工作：

/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

我只是做了它的一些基本的測試，但它會匹配類的跨度標籤（如果存在的話）以及數據，直到標籤被關閉。

來源

2011-03-16 15:39:35

很酷，你有什麼想法，我可以如何在C＃中使用它來返回一個提取值的字典？謝謝。 – 2011-03-16 15:50:40

我強烈建議您使用一個真正的HTML或XML解析器代替它。 You cannot reliably parse HTML or XML with regular expressions - 你能做的最多的事情就是靠近，越接近你的正則表達式就越複雜和耗時。如果你有一個大的HTML文件需要解析，那麼很可能會破壞任何簡單的正則表達式模式。

正則表達式像<span[^>]*>(.*?)會對您的例子，但有關於XML的有效代碼有很多這是很難甚至不可能用正則表達式來解析（例如，foo bar將打破上面的圖案）。如果你想要其他HTML樣本可以使用的東西，那麼正則表達式不是這裏的方法。

由於您的HTML代碼不是XML有效的，請考慮HTML Agility Pack，我聽說它非常好。

來源

2011-03-16 15:53:18

回答

相關問題