提取字符串在C＃中

-1

我有一些代碼的html文件是這樣的：提取字符串在C＃中

<div style="border: 0px red solid; width: 633px; position: relative; margin: 0px; 
                   float: right"> 
                   <font style="font-size: 8pt; color: Navy; font-weight: Bold;">Unit Name: </font>My Unit Name&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font style="font-size: 8pt; color: Navy; font-weight: Bold;"> 
                    Manager: </font>My Manager Name&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <font style="font-size: 8pt; 
                     color: Navy; font-weight: Bold;">Category: </font>My Category 
                  </div> 
                  <div style="border: 0px red solid; width: 122px; position: relative; margin: 0px; 
                   padding: 0px;"> 
                   <button name="sSdewfwo87kjLKH7624QAZMLLPIdyt75576rtffTfdef22de" style="font-family: Tahoma;" 
                    onclick="OpenMyWin2(1,843442,8445,'bf61fd588f00cbe7a37dab20c62e1c63')"> 
                    More Info</button></div>

我想提取的類別的前信息：&經理：&單位名稱：。我如何使用RegularExpression從大型html文件中提取這些文件。這些文件可能有100個相似的項目。

來源

2012-09-26 Ehsan Sadeghi

來處理這類任務的最佳方式是通過像[HTML敏捷性包（HTTP專用庫：//htmlagilitypack.codeplex ）// – Steve

用正則表達式解析HTML是一個禁忌。對於一個笑，閱讀[這]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags） –

我會建議你考慮使用該工具： http://htmlagilitypack.codeplex.com/

它允許輕鬆地分析你想要的任何HTML。

來源

2012-09-26 13:39:23 berliner

這是一個壞主意，使用正則表達式來解析HTML代碼，但是如果你一定要使用正則表達式，使用模式：

>\s*Unit Name:[^>]*>([^<]+).*?>\s*Manager:[^>]*>([^<]+).*?>\s*Category:[^>]*>([^<]+)

這可以減少到

>\s*(?:Unit Name|Manager|Category):[^>]*>([^<]+)

修整 尾部用(\w+)替代正則表達式中的([^<]+)。

來源

2012-09-26 13:44:42

我有一個HTML字符串，重複我的模式每頁50次。我使用IndexOf和索引。我解決了它。 –

也許這可以幫到你。這將使用Lookahead and Lookbehind Zero-Width Assertions.

(?<=(Category:|Manager:|Unit Name:) (</font>)?).*?(?=(&|<))

使用RegexBuddy截圖

enter image description here

來源

2012-09-26 13:49:51

我相信不需要使用lookaround，因爲它可能會傷害正則表達式... –

提取字符串在C＃中

回答

相關問題