2013-10-25 60 views
2
some text and some text too bad, 
some too  bad again some bad 
and other words bad, it is too  bad 

我試圖取代所有字的「壞」到「好」,但有例外:替換詞另一

若詞「太」之前的「壞」,「壞「不應該被更改爲‘好’, 二者之間可以用一個或微塵空白‘太’與‘壞’,甚至HTML空白」  「

後,所以正則表達式處理文本應該是

some text and some text too bad, 
    some too  bad again some good 
    and other words good, it is too  bad 

試過這樣的事情,但它不能正常工作。

$text ~= s/(too(\s+|\s* \s*))bad/good/ig; 

請幫

+2

雖然正則表達式的專家可以創造奇蹟,在最後總要有人理解和維護這樣的代碼。 –

回答

-1

你可以嘗試解碼html空格,並應用正則表達式的計算,如果前面的字符串是too

#!/usr/bin/env perl; 

use strict; 
use warnings; 
use HTML::Entities; 

while (<DATA>) { 
    _decode_entities($_, { nbsp => "\xA0" }); 
    s/(\w+)(\s+)bad/$1 eq 'too' ? $& : "$1$2good"/eg; 
    encode_entities($_); 
    print $_; 
} 

__DATA__ 
some text and some text too bad, 
some too&nbsp; bad again some bad 
and other words bad, it is too  bad 

運行它喜歡:

perl script.pl 

國債收益率:

some text and some text too bad, 
some too&nbsp; bad again some good 
and other words good, it is too  bad 
+0

那麼一個不可破壞的空間變得易碎? – Borodin

+0

@Borodin:謝謝你注意到這個bug。我已經添加了'encode_entities()'函數來修復它。 – Birei

+0

感謝Borodin和@Birei,它真的幫了我很大的忙 –

1

我不相信這可以方便地使用正則表達式來完成。它變得更加複雜,因爲單詞的想法尚不清楚:例如,您想將「bad」作爲單詞「bad」來對待。

該程序通過將字符串標記爲單詞和分隔符,然後將所有出現的「壞」改變爲「好」,除非它們前面有「太」(忽略大寫和小寫)。我在可能的分隔符列表中包含了逗號,冒號和分號。你可能想調整這個來獲得你期望的結果。

use strict; 
use warnings; 

my $text = <<END; 
some text and some text too bad, 
some too&nbsp; bad again some bad 
and other words bad, it is too  bad 
END 

my @tokens = split /((?:[\s,;.:]|&nbsp;)+)/, $text; 

for my $i (grep { lc $tokens[$_] eq 'bad' } 1 .. $#tokens) { 
    $tokens[$i] = 'good' unless lc $tokens[$i-2] eq 'too'; 
} 

print join '', @tokens; 

輸出

some text and some text too bad, 
some too&nbsp; bad again some good 
and other words good, it is too  bad