如何使用Perl製作條件正則表達式全局替換？

我有一個變量$content包含一段混合文本和HTML img標記和URL。如何使用Perl製作條件正則表達式全局替換？

我想做條件字符串注入做一些替換。

例如，假設$content包含

ABC <img src="http://url1.com/keep.jpg"> 
DEF <img src="http://random-url.com/replace.jpg"> 
GHI <img src="http://url2.com/keep.jpg">

我想編輯$content並使其

ABC <img src="http://url1.com/keep.jpg"> 
DEF <img src="http://wrapper-url.com/random-url.com/replace.jpg"> 
GHI <img src="http://url2.com/keep.jpg">

我對URL的正則表達式條件的列表，以保持：上述白名單匹配。除白名單外的任何圖像URL都將使用封裝網址前綴進行編輯。

我的想法是：

if image tags matched in $content { 
    if match is in 'whitelist' 
    do nothing 
    else 
    inject prefix replacement 
}

我不知道如何使有條件的正則表達式的全局替換，因爲一切都在一個單行字符串變量。

我需要在Perl中實現它。

其他信息：

我的 '白名單' 是目前唯一5行，基本上含有關鍵字和域。

以下是我一直在爲匹配「白名單」所做的工作。

例如。

if ($_ =~ /s3\.static\.cdn\.net/) { 
    # whitelist to keep, subdomain match 
} 
elsif ($_ =~ /keyword-to-keep/) { 
    # whitelist to keep, url keyword match 
} 
elsif ($_ =~ /cdn\.domain\.com/) { 
    # whitelist to keep, subdomain match 
} 
elsif ($_ =~ /whitelist-domain\.net/) { 
    # whitelist to keep, domain match 
} 
elsif ($_ =~ /i\.whitelist-domain\.com/) { 
    # whitelist to keep, subdomain match 
} 
else { 
    # matched, do something about it with injection 
}

一個不那麼完美的解決方案，我能想到的是在全球範圍內帶有前綴注射代替所有的img網址。

然後做另一個全局替換，通過匹配「白名單」來刪除前綴。

有沒有更有效的解決方案來解決我的問題？

謝謝。

來源

2016-04-02 KDX

你真的需要爲這個適當的HTML解析器。請顯示您的*正則表達式條件列表* – Borodin

原始問題修改了一些正則表達式條件我一直在使用檢查'白名單'保持。 – KDX

您可以使用HTML:TokeParser:Simple找到img標籤和提取其src屬性的URL。
您可以從網址URI:URL中提取主機名。
您可以將您的白名單轉換爲a set，以便輕鬆高效地查找主機名。
您可以使用s//運算符換行不在白名單中的主機名。

use strict; 
use warnings; 
use 5.020; 
use HTML::TokeParser::Simple; 
use URI::URL; 
use List::Util qw{ any }; 

my @white_list = qw(
    s3.static.cdn.net 
    cdn.domain.com 
    whitelist-domain.net 
    i.whitelist-domain.com 
); 
#Create a set: 
my %white_list = map {$_ => undef} @white_list; 

my @accepted_keywords = qw(
    xxx.xxx 
    cool 
); 
#Escape any special regex characters appearing in the keywords: 
@accepted_keywords = map { quotemeta $_ } @accepted_keywords; 

my $wrapper_host = "wrapper-url.com"; 

my $content = <<END_OF_CONTENT; 
ABC <img src="http://i.whitelist-domain.com/keep.jpg"> 
DEF <img src="http://random-url.com/replace.jpg"> 
GHI <img src="http://cdn.domain.com/keep.jpg"> 
XYZ <img src="http://random-url.com/replace.jpg"> 
ZZZ <img src="http://xxx.xxx/keep.jpg"> 
ZZZ <img src="http://xxxXxxx/replace.jpg"> 
ZZZ <img src="http://waycool.com/keep.jpg"> 
END_OF_CONTENT 

my $parser = HTML::TokeParser::Simple->new(\$content); 

my ($src, $url, $host, $regex); 
while (my $token = $parser->get_token()) { 

    if ($token->is_tag('img')) { 
     if ($src = $token->get_attr('src')) { 
      $url = URI::URL->new($src); 
      $host = $url->host; 

      next if exists($white_list{$host}); 
      next if any { $host =~ /$_/ } @accepted_keywords; 

      $src =~ s/(http:\/\/)/$1$wrapper_host\//xms; 
      $token->set_attr(
       'src', 
       $src, 
      ); 

     } 
    } 
} 
continue { 
    print $token->as_is; 
} 

--output:-- 
ABC <img src="http://i.whitelist-domain.com/keep.jpg"> 
DEF <img src="http://wrapper-url.com/random-url.com/replace.jpg"> 
GHI <img src="http://cdn.domain.com/keep.jpg"> 
XYZ <img src="http://wrapper-url.com/random-url.com/replace.jpg"> 
ZZZ <img src="http://xxx.xxx/keep.jpg"> 
ZZZ <img src="http://wrapper-url.com/xxxXxxx/replace.jpg"> 
ZZZ <img src="http://waycool.com/keep.jpg">

來源

2016-04-03 04:06:55 7stud

使用HTML :: TokeParser :: Simple的行爲對我的問題來說是一個更清潔的解決方案。經過細微的修改，該解決方案對我來說非常合適。謝謝。 – KDX

正如其他人所提到的，強烈建議使用RE來解析HTML，因爲原因，請參閱here（在很多其他地方）。

由於您的示例數據非常簡單，只要您記住這些限制，就可以忽略該建議。一些

要考慮的事情是;

如果您的白名單關鍵字與域的一部分相匹配，該怎麼辦？
反之亦然 - 如果一個域（.net）是路徑的一部分呢？
如果該方案不是http（s），會發生什麼情況？
如果URL不是雙引號會怎麼樣？或者任何報價？
如果在「前文本」中看起來像一個標籤，該怎麼辦？
白名單中的條目是否區分大小寫？域名不是;路徑是;那麼該怎麼辦？

我在下面的解決方案中使用了幾個原則：從正則表達式使用

總是使用擴展模式regexs

單獨的正則表達式規範，即：使用「/ X」選項
預處理的白名單以使RE「測試」的陣列，以通過
UNIX過濾器的風格 - 在STDIN讀，寫在標準輸出上，警告在STDERR
使用一個模塊的URL

處理部分考慮到這些東西的細節要考慮，這基本上會這樣做;

use v5.12; 
use URI::URL; 

my $wrapper_host = "wrapper-url.com" ; 
my $whitelist_file = "whitelist.txt" ; 
URI::URL::strict 1; # Will croak if cannot determine scheme 

my $text_re = qr/^(\s* [^<]+ \s*) /x ; 
my $quoted_str = qr/ " ([^"]+) " /x ; 
my $img_tag_re = qr/ < img \s+ src= $quoted_str > /x ; 

my @whitelist_rules ; 
open(my $white, '<', $whitelist_file) or die "$whitelist_file: $!\n" ; 
while (<$white>) { 
    chomp; 
    s/\./\\./; # escape '.' 
    push @whitelist_rules, qr/$_/ ; 
} 
close $white ; 

while (<>) { 

    # Parse the line into text and url 
    my $text; my $url; 
    if (/ $text_re $img_tag_re /x) { 
     $text = $1 ; 
     $url = new URI::URL $2 ; # may croak 
    } 
    else { 
     warn "Can't make sense of line $., skipping..." ; 
     next ; 
    } 

    # iterate over @whitelist_rules to see if this one is exempt 
    my $on_whitelist = 0; 
    for my $r (@whitelist_rules) { 
     $on_whitelist++ if $url =~ /$r/i ;   # Note: '/i' 
     # $on_whitelist++ if $url->netloc =~ /$r/i ; # alternatively ... 
     # $on_whitelist++ if $url->path =~ /$r/i ; # alternatively ... 
    } 

    # If its not on the whitelist, wrap netloc 
    if (! $on_whitelist) { 
     $url->path($url->netloc . $url->path); 
     $url->netloc($wrapper_host); 
    } 

    # output the transformed line 
    say $text . $url ; 
}

來源

2016-04-03 00:17:51 Marty

謝謝你對我沒有想到的情景的詳細分析。我最終使用HTML :: TokeParser :: Simple來提取圖像URL，而不是使用RE，與我的白名單匹配，然後將其保存回原始的$ content變量。 – KDX

如何使用Perl製作條件正則表達式全局替換？

回答

相關問題