regex
  • html-content-extraction
  • text-extraction
  • 2011-03-16 70 views 2 likes 
    2

    考慮下面的HTML:如何使用RegEx從HTML中提取值?

    <p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq: <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p> 
    

    我想獲得<span>元素中的值。我還想獲得<span>元素上class屬性的值。

    理想情況下,我可以通過函數運行一些HTML並獲取提取實體的字典(基於上面定義的<span>解析)。

    上述代碼是來自較大源HTML文件的代碼片段,它無法與XML解析器進行比較。所以我正在尋找一個可能的正則表達式來幫助提取感興趣的信息。

    +0

    什麼編程語言是您使用?有一些庫會採用HTML不是有效的XML,並且仍允許使用xpath表達式等來查詢信息。 – 2011-03-16 15:26:37

    +0

    編程語言= .net – 2011-03-16 15:32:40

    回答

    6

    使用該工具(免費): http://www.radsoftware.com.au/regexdesigner/

    使用這個表達式:

    "<span[^>]*>(.*?)</span>" 
    

    在組1(每場比賽)的值將是你所需要的文本。

    在C#中它會看起來像:

      Regex regex = new Regex("<span[^>]*>(.*?)</span>"); 
          string toMatch = "<span class=\"ajjsjs\">Some text</span>"; 
          if (regex.IsMatch(toMatch)) 
          { 
           MatchCollection collection = regex.Matches(toMatch); 
           foreach (Match m in collection) 
           { 
            string val = m.Groups[1].Value; 
            //Do something with the value 
           } 
          } 
    

    Ammended回答評論:

      Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>"); 
          string toMatch = "<span class=\"ajjsjs\">Some text</span>"; 
          if (regex.IsMatch(toMatch)) 
          { 
           MatchCollection collection = regex.Matches(toMatch); 
           foreach (Match m in collection) 
           { 
            string class = m.Groups[1].Value; 
            string val = m.Groups[2].Value; 
            //Do something with the class and value 
           } 
          } 
    
    +0

    我的示例代碼不適用於嵌套跨度,但是然後再次沒有在您提供的示例html中。 – 2011-03-16 16:03:57

    +0

    這適用於獲取值,謝謝。你有什麼想法,我怎麼能得到「類」屬性的價值呢? – 2011-03-16 16:09:36

    +0

    這正是我正在尋找的 - 你搖滾!謝謝 – 2011-03-16 16:21:41

    2

    假設你有沒有嵌套 span標籤,下面應該工作:

    /<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

    我只是做了它的一些基本的測試,但它會匹配類的跨度標籤(如果存在的話)以及數據,直到標籤被關閉。

    +0

    很酷,你有什麼想法,我可以如何在C#中使用它來返回一個提取值的字典?謝謝。 – 2011-03-16 15:50:40

    1

    強烈建議您使用一個真正的HTML或XML解析器代替它。 You cannot reliably parse HTML or XML with regular expressions - 你能做的最多的事情就是靠近,越接近你的正則表達式就越複雜和耗時。如果你有一個大的HTML文件需要解析,那麼很可能會破壞任何簡單的正則表達式模式。

    正則表達式像<span[^>]*>(.*?)</span>會對您的例子,但有關於XML的有效代碼有很多這是很難甚至不可能用正則表達式來解析(例如,<span>foo <span>bar</span></span>將打破上面的圖案)。如果你想要其他HTML樣本可以使用的東西,那麼正則表達式不是這裏的方法。

    由於您的HTML代碼不是XML有效的,請考慮HTML Agility Pack,我聽說它非常好。

    相關問題