刪除特定標籤如果在特定標籤內

我想刪除所有「表」中的「br」和「p」標籤，但不是在外面。

例如，

初始html文件：

... 
<p>Hello</p> 
<table> 
    <tr> 
    <td><p>Text example <br>continues...</p></td> 
    <td><p>Text example <br>continues...</p></td> 
    <td><p>Text example <br>continues...</p></td> 
    <td><p>Text example <br>continues...</p></td> 
    </tr> 
</table> 
<p>Bye<br></p> 
<p>Bye<br></p> 
...

我的目標：

... 
<p>Hello</p> 
<table> 
    <tr> 
    <td>Text example continues...</td> 
    <td>Text example continues...</td> 
    <td>Text example continues...</td> 
    <td>Text example continues...</td> 
    </tr> 
</table> 
<p>Bye<br></p> 
<p>Bye<br></p> 
...

現在，這就是我的方法來清潔：

loop do 
    if html.match(/<table>(.*?)(<\/?(p|br)*?>)(.*?)<\/table>/) != nil 
    html = html.gsub(/<table>(.*?)(<\/?(p|br)*?>)(.*?)<\/table>/,'<table>\1 \4</table>') 
    else 
    break 
    end 
end

那偉大工程，但問題是，我有1xxx文件，每個人有大約1000行......每個人需要1-3小時。（（1-3小時）*（數千文件））=痛苦！

我想用Sanitize或其他方法做，但是......現在......我沒有找到方法。

任何人都可以幫助我嗎？

預先感謝您！馬努

來源

2013-07-30 user2634870

http://stackoverflow.com/a/1732454/438992換句話說，使用實際HTML解析器。 –

^爲了增加上述內容，請使用'Nokogiri'進行調查。 –

**不要使用正則表達式來解析HTML。使用合適的HTML解析模塊**您無法可靠地使用正則表達式解析HTML，並且您將面臨悲傷和挫折。只要HTML從你的期望改變，你的代碼就會被破壞。有關如何使用已經編寫，測試和調試的Ruby模塊正確解析HTML的示例，請參閱http://htmlparsing.com/ruby。 –

使用Nokogiri：

require 'nokogiri' 

doc = Nokogiri::HTML::Document.parse <<-_HTML_ 
<p>Hello</p> 
<table> 
    <tr> 
    <td><p>Text example <br>continues...</p></td> 
    <td><p>Text example <br>continues...</p></td> 
    <td><p>Text example <br>continues...</p></td> 
    <td><p>Text example <br>continues...</p></td> 
    </tr> 
</table> 
<p>Bye<br></p> 
<p>Bye<br></p> 
_HTML_ 

doc.xpath("//table/tr/td/p").each do |el| 
    el.replace(el.text) 
end 

puts doc.to_html

輸出：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 
<html><body> 
<p>Hello</p> 
<table><tr> 
<td>Text example continues...</td> 
    <td>Text example continues...</td> 
    <td>Text example continues...</td> 
    <td>Text example continues...</td> 
    </tr></table> 
<p>Bye<br></p> 
<p>Bye<br></p> 
</body> 
</html>

來源

2013-07-30 16:31:25

段落標籤也需要從表格中刪除。 –

@JustinKo好吧..忽略那..給我一點時刻.. –

@JustinKo我完成了.... –

刪除特定標籤如果在特定標籤內

回答

相關問題