如何在Perl中匹配字符串與變音符號？

例如，匹配「民族報」在「」國際化」沒有額外的模塊，是否有可能在新的Perl版本（5.14，5.15等）？如何在Perl中匹配字符串與變音符號？

我找到了答案！感謝tchrist

與UCA匹配分辯溶液（日Thnx到https://stackoverflow.com/users/471272/tchrist）。

# found start/end offsets for matched utf-substring (without intersections) 
use 5.014; 
use strict; 
use warnings; 
use utf8; 
use Unicode::Collate; 
binmode STDOUT, ':encoding(UTF-8)'; 
my $str = "Îñţérñåţîöñåļîžåţîöñ" x 2; 
my $look = "Nation"; 
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1 
    ); 

my @match = $Collator->match($str, $look); 
if (@match) { 
    my $found = $match[0]; 
    my $f_len = length($found); 
    say "match result: $found (length is $f_len)"; 
    my $offset = 0; 
    while ((my $start = index($str, $found, $offset)) != -1) {             
     my $end = $start + $f_len; 
     say sprintf("found at: %s,%s", $start, $end); 
     $offset = $end + 1; 
    } 
}

錯誤（但工作）從溶液

的代碼魔術段子：

$str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;

代碼示例：

use 5.014; 
    use utf8; 
    use Unicode::Normalize; 

    binmode STDOUT, ':encoding(UTF-8)'; 
    my $str = "Îñţérñåţîöñåļîžåţîöñ"; 
    my $look = "Nation"; 
    say "before: $str\n"; 
    $str = NFD($str); 
    # M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html) 
    $str =~ s/\pM//og; # remove "marks" 
    say "after: $str";¬ 
    say "is_match: ", $str =~ /$look/i || 0;

來源

2011-09-15 nordicdyno

+1的毛茸茸的例子。 – Bojangles

我不知道是否有任何直接的支持，但你可以向規範化完全分解，然後用剝離任何字符‘加盟’屬性（ISTR有這樣一個屬性，雖然不知道它叫什麼） – tripleee

googe「perl刪除所有變音符號」看起來很有希望的很多匹配 –

右溶液與UCA（日Thnx到tchrist）：

# found start/end offsets for matched s 
use 5.014; 
use utf8; 
use Unicode::Collate; 
binmode STDOUT, ':encoding(UTF-8)'; 
my $str = "Îñţérñåţîöñåļîžåţîöñ" x 2; 
my $look = "Nation"; 
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1 
    ); 

my @match = $Collator->match($str, $look); 
say "match ok!" if @match;

P.S. 「假設您可以刪除變音符以獲取基本ASCII字母的代碼是邪惡的，仍然是殘破的，腦損壞的，錯誤的，並且是對死刑的理由。「 ©tchristWhy does modern Perl avoid UTF-8 by default?

來源

2011-09-16 06:06:32 nordicdyno

你是什麼意思「沒有額外的模塊」是什麼意思？

這裏是use Unicode::Normalize;see on perl doc

的解決方案我去掉了「T」，並從您的字符串爲「L」，我的日食沒曾想保存腳本他們。

use strict; 
use warnings; 
use UTF8; 
use Unicode::Normalize; 

my $str = "Îñtérñåtîöñålîžåtîöñ"; 

for ($str) { # the variable we work on 
    ## convert to Unicode first 
    ## if your data comes in Latin-1, then uncomment: 
    #$_ = Encode::decode('iso-8859-1', $_); 
    $_ = NFD($_); ## decompose 
    s/\pM//g;   ## strip combining characters 
    s/[^\0-\x80]//g; ## clear everything else 
} 

if ($str =~ /nation/) { 
    print $str . "\n"; 
}

輸出是

Internationaliation

的「Z」是從字符串刪除，它似乎不是一個組合字符。

的for循環的代碼是從這個方面How to remove diacritic marks from characters

另一個有趣的閱讀是從喬爾斯波斯基The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

更新：

由於@tchrist指出，有存在的算法，這更適合稱爲UCA（Unicode排序算法）。 @nordicdyno已經在他的問題中提供了一個實現。

該算法這裏Unicode Technical Standard #10, Unicode Collation Algorithm

描述perl的模塊這裏描述上perldoc.perl.org

來源

2011-09-15 12:29:24 stema

謝謝！在我的環境中，「ž」未被刪除，所有工作都正常。（vim + Mac OS X + perl 5.14.0） – nordicdyno

這不是這樣做的方法。你想要一個一級UCA匹配，這只是主要的力量，因此忽略了變音符號。 – tchrist

@tchrist我已經從你的答案和評論中學到了很多關於unicode的知識（謝謝），但我認爲還不夠。說實話，我不知道你的評論是什麼意思。（UCA代表什麼？） – stema

如何在Perl中匹配字符串與變音符號？

回答

相關問題