在Perl中刪除包含標籤本身的HTML標籤中的內容

大約有100個文件，我需要通過它們中的每一個並刪除<style>和</style>之間的所有數據並刪除這些標籤。在Perl中刪除包含標籤本身的HTML標籤中的內容

例如

<html> 
<head> <title> Example </title> </head> 
<style> 
p{color: red; 
background-color: #FFFF; 
} 
div {...... 
... 
} 
</style> 
<body> 
<p> hi I'm a paragraph. </p> 
</body> 
</html>

應該成爲

<html> 
<head> <title> Example </title> </head> 
<body> 
<p> hi I'm a paragraph. </p> 
</body> 
</html>

此外，在一些文件中的風格圖案就像是

<style type="text/css"> blah </style>

或

<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css">

我需要刪除所有3種模式。我如何在Perl中做到這一點？

來源

2012-10-03 Chankey Pathak

它看起來像你正在尋找處理正則表達式HTML。這是一個[令人驚訝的不好主意]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags）！ –

use strict; 
use warnings; 

use XML::LibXML qw(); 

my $qfn = 'a.html'; 

my $doc = XML::LibXML->load_html(location => $qfn); 
my $root = $doc->documentElement(); 

for my $style_node ($root->findnodes('//style')) { 
    $style_node->parentNode()->removeChild($style_node); 
} 

{ 
    open(my $fh, '>', $qfn) 
     or die; 
    print($fh $doc->toStringHTML()); 
}

它能夠正確處理：

風格元素，其標籤屬性或空間，跨越多行
風格元素，
跨越多行樣式標籤，
包含部分樣式元素和其他內容的行，
具有多個樣式元素的文檔，
東西，看起來像在屬性值的風格標籤，
東西，看起來像在CDATA塊風格標記，
東西，看起來像在註釋的風格標籤。

截至本次更新，其他解決方案僅處理其中的2個或3個。使用sed

來源

2012-10-03 06:03:31 ikegami

輸出結果如下：http://pastebin.com/uWAamD19。它添加了DOCTYPE和我不需要的。 –

如何刪除DOCTYPE和？我不知道它爲什麼會自動生成。 –

你不能。這就是它的作用。沒有理由將其刪除。 – ikegami

方式一：

sed '/<style>/,/<\/style>/d' file.txt

結果：

<html> 
<head> <title> Example </title> </head> 
<body> 
<p> hi I'm a paragraph. </p> 
</body> 
</html>

來源

2012-10-03 06:07:12 Steve

不錯，很高興知道sed中的解決方案。 –

假設在一行上除了樣式元素外沒有任何東西。那假設''不會出現在屬性值，CDATA和註釋中。等等 – ikegami

@ikegami：你是對的。我對輸入做了一些假設。應該使用XML解析器+1 – Steve

perl -lne 'print unless(/<style>/.../<\/style>/)' your_file

如下測試：

> cat temp 
<html> 
<head> <title> Example </title> </head> 
<style> 
p{color: red; 
background-color: #FFFF; 
} 
div {...... 
... 
} 
</style> 
<body> 
<p> hi I'm a paragraph. </p> 
</body> 
</html> 


> perl -lne 'print unless(/<style>/.../<\/style>/)' temp 
<html> 
<head> <title> Example </title> </head> 
<body> 
<p> hi I'm a paragraph. </p> 
</body> 
</html> 
>

如果你想就地做，那麼：

perl -i -lne 'print unless(/<style>/.../<\/style>/)' your_file

來源

2012-10-03 09:53:35 Vijay

這假設一行中除了樣式元素外沒有其他東西。假定樣式元素沒有屬性。假定''不出現在屬性值，CDATA和註釋中。等 – ikegami

池上是正確的，你真的應該至少使用一個HTML/XML解析器來完成這項任務。我個人喜歡使用Mojo::DOM解析器。這是一個文檔對象模型接口到您的HTML和它支持CSS3 selectors，使它真正靈活，當你需要它。這是一個非常簡單的一個爲它不過：

#!/usr/bin/env perl 

use strict; 
use warnings; 

use Mojo::DOM; 

my $content = <<'END'; 
<html> 
<head> <title> Example </title> </head> 
<style> 
p{color: red; 
background-color: #FFFF; 
} 
div {...... 
... 
} 
</style> 
<body> 
<p> hi I'm a paragraph. </p> 
</body> 
</html> 
END 

my $dom = Mojo::DOM->new($content); 
$dom->find('style')->pluck('remove'); 

print $dom;

的pluck方法是有點混亂，但它其實只是每個結果對象上執行的方法的簡寫。類似的行可能是

$dom->find('style')->each(sub{ $_->remove });

這是一個更容易理解，但不太可愛。

閱讀你的編輯，你必須處理後，更多的只是你的基本形式，我要進一步強調，這就是爲什麼你使用的解析器修改HTML而不是讓你的正則表達式增長到荒謬的比例。

現在讓我們說，$content變量也包含這些線路

<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css"> 
<link rel="icon" href="somefile.jpg">

要刪除第一個

，而不是第二。您可以通過以下兩種方式之一來完成此操作。

$dom->find('link')->each(sub{ $_->remove if $_->{rel} eq 'stylesheet' });

該機制使用的對象的方法（和魔:: DOM暴露屬性作爲哈希鍵）只除去link標籤具有rel=stylesheet。但是，您可以使用CSS3選擇器只find這些元素，但是，因爲魔:: DOM具有完全CSS3選擇器支持，你可以做

$dom->find('link[rel=stylesheet]')->pluck('remove');

CSS3選擇器陳述可以用逗號連接到既找不到匹配的所有標籤選擇器，所以我們可以簡單的包含這行

$dom->find('style, link[rel=stylesheet]')->pluck('remove');

並且一舉擺脫你所有的攻勢樣式表！

來源

2012-10-03 21:18:11

你能告訴我，我正在監考考試，沒有什麼比較好做，但考慮CSS3選擇器。 :-) –

不幸的是我必須使用Perl 5.8：/ –

請參閱[perlbrew]（http://perlbrew.pl/） –

另一個可能的解決方案是使用HTML::TreeBuilder。

#!/usr/bin/perl 

use strict; 
use warnings; 
use HTML::TreeBuilder 5; # Ensure weak references in use 

foreach my $file_name (@ARGV) { 
    my $tree = HTML::TreeBuilder->new; # empty tree 
    $tree->parse_file($file_name); 
    # print "Hey, here's a dump of the parse tree of $file_name:\n"; 
    # $tree->dump; # a method we inherit from HTML::Element 
    foreach my $e ($tree->look_down(_tag => "style")) { 
     $e->delete(); 
    } 
    foreach my $e ($tree->look_down(_tag => "link", rel => "stylesheet")) { 
     $e->delete(); 
    } 
    print "And here it is, bizarrely rerendered as HTML:\n", 
    $tree->as_HTML, "\n"; 

    # Now that we're done with it, we must destroy it. 
    $tree = $tree->delete; # Not required with weak references 
}

來源

2012-10-04 04:41:19

我想出了一個辦法，你可以嘗試以下方法：

#! /usr/bin/perl -w 
use strict; 
my $line = << 'END'; 
<html> 
<head> <title> Example </title> </head> 
<style> 
p{color: red; 
background-color: #FFFF; 
} 
div {...... 
... 
} 
</style> 
<body> 
<p> hi I'm a paragraph. </p> 
</body> 
</html> 
END 

$line =~ s{<style[^>]*.*?</style>.}{}gs; 
print $line;

來源

2012-10-06 01:34:37 yizhengming

在Perl中刪除包含標籤本身的HTML標籤中的內容

回答

相關問題