修復其中帶有雙引號的HTML屬性值

我有一組HTML文件，其href屬性的標籤中含有非法語法。例如，修復其中帶有雙引號的HTML屬性值

<a name="Conductor, "neutral""></a>

或

<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," />

或

<b>Table of Contents:</b><ul class="xoxo"><li><a href="1.html" title="Page 1: What are "series" and "parallel" circuits?">What are "series" and "parallel" circuits?</a>

我試圖處理與利用parsefile_html($file_name) Perl的XML::Twig模塊的文件。當讀取具有此語法的文件，它給這個錯誤：

x has an invalid attribute name 'y""' at C:/strawberry/perl/site/lib/XML/Twig.pm line 893

我需要的是兩種方式，以使模塊接受不良的語法和處理它，或正則表達式查找和替換帶有單引號的屬性中的雙引號。

來源

2012-05-16 Nate Glenn

鑑於你的HTML示例，代碼如下工作：

use Modern::Perl; 

my $html = <<end; 
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 
<a name="Conductor, "neutral""></a> 
end 

$html =~ s/(?<=content=")(.*?)(?="\s*\/>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg; 
$html =~ s/(?<=name=")(.*?)(?="\s*>)/do{my $capture = $1; $capture =~ s|"||g;$capture}/eg; 

say $html;

輸出：

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 
<a name="Conductor, neutral"></a>

我擔心可變長度向後看沒有實現，所以如果有一些空間在等號之前或之後，模式匹配將失敗。但是，頁面最有可能一直創建，所以匹配不會失敗。

當然，首先嚐試替換文件的副本。

來源

2012-05-16 04:09:32 Kenosis

偉大的作品！這是一個非常強烈的正則表達式。我會接受你的回答，但有些評論會對你有所幫助。 –

哎呀！對這個字符串沒有太大的作用：'' –

重複第二個替換（添加另一個替換）（？<= title =「）'。匹配試圖捕獲參數封閉的雙引號之間的所有內容，然後執行另一個替換以刪除它所找到的額外引號，然後用新引用替換舊引號。 – Kenosis

我能想到合理安全地執行此操作的唯一方法是使用兩個嵌套評估（/e）替換。下面的程序使用這個想法並且處理你的數據。

外層替換查找字符串中的所有標記，並用包含調整後的屬性值的標記替換它們。

內部結構查找標籤中的所有屬性值，並使用刪除所有雙引號的相同值替換它們。

use strict; 
use warnings; 

my $html = <<'HTML'; 
<meta name="keywords" content="Conductor, "hot",Conductor, "neutral",Hot wire,Neutral wire,Double insulation,Conductor, "ground",Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 
<a name="Conductor, "neutral""></a> 
<a href="1.html" title="Page 1: What are "series" and "parallel" circuits?"> 
HTML 

$html =~ s{(<[^>]+>)}{ 

    my $tag = $1; 

    $tag =~ s{ \w+= " \K ([^=<>]+) (?= " (?: \s+\w+= | \s*/?>)) } 
    { 
    (my $attr = $1) =~ tr/"//d; 
    $attr; 
    }egx; 

    $tag; 
}eg; 

print $html;

輸出

<meta name="keywords" content="Conductor, hot,Conductor, neutral,Hot wire,Neutral wire,Double insulation,Conductor, ground,Ground fault,GFCI,Ground Fault Current Interrupter,Ground fault,GFCI,Ground Fault Current Interrupter,Arc fault circuit interrupter,Arc fault breaker,AFCI," /> 
<a name="Conductor, neutral"></a> 
<a href="1.html" title="Page 1: What are series and parallel circuits?">

來源

2012-05-16 14:11:10 Borodin

這是優秀的，鮑羅廷你在解決方案中採取的兩個步驟很有意義 – Kenosis

我曾經有過a>在一個屬性值中，當我仍然匹配這樣的標籤時，這真是一個糟糕的日子:) –

@brian：我猜他們必須被編碼爲'>' - 即使在屬性中。像那樣！ – Borodin

修復其中帶有雙引號的HTML屬性值

回答

相關問題