2012-05-16 38 views
3

我有一組HTML文件,其href屬性的標籤中含有非法語法。例如,修復其中帶有雙引號的HTML屬性值

<a name="Conductor, "neutral""></a> 

<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 

<b>Table of Contents:</b><ul class="xoxo"><li><a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">What are "series" and "parallel" circuits?</a> 

我試圖處理與利用parsefile_html($file_name) Perl的XML::Twig模塊的文件。當讀取具有此語法的文件,它給這個錯誤:

x has an invalid attribute name 'y""' at C:/strawberry/perl/site/lib/XML/Twig.pm line 893 

我需要的是兩種方式,以使模塊接受不良的語法和處理它,或正則表達式查找和替換帶有單引號的屬性中的雙引號。

回答

2

鑑於你的HTML示例,代碼如下工作:

use Modern::Perl; 

my $html = <<end; 
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 
<a name="Conductor, "neutral""></a> 
end 

$html =~ s/(?<=content=")(.*?)(?="\s*\/>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg; 
$html =~ s/(?<=name=")(.*?)(?="\s*>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg; 

say $html; 

輸出:

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 
<a name="Conductor, neutral"></a> 

我擔心可變長度向後看沒有實現,所以如果有一些空間在等號之前或之後,模式匹配將失敗。但是,頁面最有可能一直創建,所以匹配不會失敗。

當然,首先嚐試替換文件的副本。

+0

偉大的作品!這是一個非常強烈的正則表達式。我會接受你的回答,但有些評論會對你有所幫助。 –

+0

哎呀!對這個字符串沒有太大的作用:'' –

1

我能想到合理安全地執行此操作的唯一方法是使用兩個嵌套評估(/e)替換。下面的程序使用這個想法並且處理你的數據。

外層替換查找字符串中的所有標記,並用包含調整後的屬性值的標記替換它們。

內部結構查找標籤中的所有屬性值,並使用刪除所有雙引號的相同值替換它們。

use strict; 
use warnings; 

my $html = <<'HTML'; 
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 
<a name="Conductor, "neutral""></a> 
<a href="1.html" title="Page 1: What are "series" and "parallel" circuits?"> 
HTML 

$html =~ s{(<[^>]+>)}{ 

    my $tag = $1; 

    $tag =~ s{ \w+= " \K ([^=<>]+) (?= " (?: \s+\w+= | \s*/?>)) } 
    { 
    (my $attr = $1) =~ tr/"//d; 
    $attr; 
    }egx; 

    $tag; 
}eg; 

print $html; 

輸出

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 
<a name="Conductor, neutral"></a> 
<a href="1.html" title="Page 1: What are series and parallel circuits?"> 
+0

這是優秀的,鮑羅廷你在解決方案中採取的兩個步驟很有意義 – Kenosis

+0

我曾經有過a>在一個屬性值中,當我仍然匹配這樣的標籤時,這真是一個糟糕的日子:) –

+0

@brian:我猜他們必須被編碼爲'>' - 即使在屬性中。像那樣! – Borodin

相關問題