2015-05-18 78 views
0

我目前正在編寫一個腳本來解析HTML文檔中的內容位。如何在第一次出現字符串時多次使用

下面是代碼我解析的例子:

<div class="tab-content"> 
<div class="tab-pane fade in active" id="how-to-take"> 
<div class="panel-body"> 
<h3>What is Pantoprazole?</h3> 
Pantoprazole is a generic drug used to treat certain conditions where there is too much acid in the stomach. It is 
used to treat gastric and duodenal ulcers, erosive esophagitis, and gastroesophageal reflux disease (GERD). GERD is 
a condition where the acid in the stomach washes back up into the esophagus. <br/> Pantoprazole is a proton pump 
inhibitor (PPI). It works by decreasing the amount of acid produced by the stomach. 
<h3>How To Take</h3> 
Take the tablets 1 hour before a meal without chewing or breaking them and swallow them whole with some water 
</div> 
</div> 
<div class="tab-pane fade" id="alternative-treatments"> 
<div class="panel-body"> 
<h3>Alternatives</h3> 
Antacids taken as required Antacids are alkali liquids or tablets 
that can neutralise the stomach acid. A dose may give quick relief. 
There are many brands which you can buy. You can also get some on 
prescription. If you have mild or infrequent bouts of dyspepsia you 
may find that antacids used as required are all that you need.<br/> 
</div> 
</div> 
<div class="tab-pane fade" id="side-effects"> 
<div class="panel-body"> 
<p>Most people who take acid reflux medication do not have any side-effects. 
However, side-effects occur in a small number of users. The most 
common side-effects are:</p> 
<ul> 

我試圖解析所有的內容:

<div class="tab-pane fade in active" id="how-to-take"> 
<div class="panel-body"> 

</div> 

我已經寫以下正則表達式代碼:

<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n(?:<\/div>) 

,並曾嘗試:

<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n<\/div> 

但它似乎並沒有在第一<\/div>要停止繼續直到代碼的最後<div>

+3

[不使用正則表達式來解析HTML]做到這一點很容易(http://stackoverflow.com /問題/ 1732348 /正則表達式匹配開放標籤,除了-XHTML-自足標籤/ 1732454#1732454)。你可以使用'HtmlAgilityPack'。 –

+0

這個軟件只是內部的,只是想讓它快速完成:)。不會在我強制執行後使用:) – user1838222

+1

[如何使用HTML敏捷包](http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack)。這是你正在尋找的正則表達式,但你必須使用解析器。 '(?s)

\s*
\s*((?:(?!
)。)*?)\ s *
' –

回答

3

Don't use regex to parse HTML。您可以使用HtmlAgilityPack

然後這個工程根據需要:

var doc = new HtmlAgilityPack.HtmlDocument(); 
doc.LoadHtml(File.ReadAllText("Path")); 
var divPanelBody = doc.DocumentNode.SelectSingleNode("//div[@class='panel-body']"); 
string text = divPanelBody.InnerText.Trim(); // null check omitted 

結果:

什麼是泮托拉唑?泮托拉唑是一種仿製藥,用於治療某些胃酸過多的某些病症。用於治療胃和十二指腸潰瘍,糜爛性食管炎和胃食管反流病(GERD)的是 。 GERD是胃中的酸被衝回食道的一種病症。泮托拉唑 是質子泵抑制劑(PPI)。它通過減少胃產生的酸的量來起作用。如何採取飯前藥片1小時 不加咀嚼或破壞它們,並與一些水

這是另一個LINQ的做法,我更喜歡在XPath語法吞下整個 :

var divPanelBody = doc.DocumentNode.Descendants("div") 
    .FirstOrDefault(d => d.GetAttributeValue("class", "") == "panel-body"); 

請注意,這兩種方法都區分大小寫,因此它們不會找到Panel-Body。你可以把過去的做法不區分大小寫容易:

var divPanelBody = doc.DocumentNode.Descendants("div") 
    .FirstOrDefault(d => d.GetAttributeValue("class", "").Equals("panel-body", StringComparison.InvariantCultureIgnoreCase)); 
0

您可以通過使用HtmlAgilityPack

public string GetInnerHtml(string html) 
{ 
     HtmlDocument doc = new HtmlDocument(); 
     doc.LoadHtml(html); 
     var nodes = doc.DocumentNode.SelectNodes("//div[@class=\"panel-body\"]"); 
     StringBuilder sb = new StringBuilder(); 
     foreach (var n in nodes) 
     { 
      sb.Append(n.InnerHtml); 
     } 
     return sb.ToString(); 
} 
相關問題