正常化ASCII字符

我需要正常化字符串，如「quée」，我似乎無法將擴展的ASCII字符（如é，á，í等）轉換爲羅馬/英文版本。我嘗試了幾種不同的方法，但目前爲止沒有任何效果。這個主題有相當數量的資料，但我似乎無法找到有效解決這個問題的答案。正常化ASCII字符

這裏是我的代碼：

#transliteration solution (works great with standard chars but doesn't find the 
#special ones) - I've tried looking for both \x{130} and é with the same result. 
$mystring =~ tr/\\x{130}/e/; 

#converting into array, then iterating through and replacing the specific char 
#(same result as the above solution) 
my @breakdown = split("",$mystring); 

foreach (@breakdown) { 
    if ($_ eq "\x{130}") { 
     $_ = "e"; 
     print "\nArray Output: @breakdown\n"; 
    } 
    $lowercase = join("",@breakdown); 
}

來源

2012-05-24 Andrew Coomes

1）本article應該提供一個相當不錯的（如果合併）的方式。

它提供了將所有重音Unicode字符轉換爲基本字符+重音的解決方案;一旦完成，您可以簡單地單獨刪除重音字符。

2）另一種選擇是CPAN：Text::Unaccent::PurePerl（一種改進的純Perl版本的Text::Unaccent）

3）另外，this SO answer提出Text::Unidecode：

$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")' 
    ete

來源

2012-05-24 17:26:58 DVK

精彩的解決方案，它很棒！謝謝！ –

你原來的代碼不起作用的原因是t帽子\x{130}不是é。這是LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130 or İ)。你的意思是\x{E9}或只是\xE9（兩個數字的大括號是可選的），LATIN SMALL LETTER E WITH ACUTE (U+00E9)。

另外，你在你的tr中有一個額外的反斜槓;它應該看起來像tr/\xE9/e/。

有了這些改變，你的代碼就可以工作了，儘管我仍然推薦使用CPAN上的一個模塊來處理這類事情。我自己喜歡Text::Unidecode，因爲它處理的不僅僅是重音字符。

來源

2012-05-24 18:00:26 cjm

謝謝你的幫助！我實施了您的更改，現在可以使用。我實際上是在交付版本中使用模塊，因爲它似乎是最優雅的方式，但很高興知道我並不太遙遠。 –

工作和重新工作後，這是我現在擁有的。它正在做我想做的一切，除了我希望在輸入字符串中間保留空格來區分單詞。

open FILE, "funnywords.txt"; 

# Iterate through funnywords.txt 
while (<FILE>) { 
    chomp; 

    # Show initial text from file 
    print "In: '$_' -> "; 

    my $inputString = $_; 

    # $inputString is scoped within a for each loop which dissects 
    # unicode characters (example: "é" splits into "e" and "´") 
    # and throws away accent marks. Also replaces all 
    # non-alphanumeric characters with spaces and removes 
    # extraneous periods and spaces. 
    for ($inputString) { 
     $inputString = NFD($inputString); # decompose/dissect 
     s/^\s//; s/\s$//;     # strip begin/end spaces 
     s/\pM//g;       # strip odd pieces 
     s/\W+//g;       # strip non-word chars 
    } 

    # Convert to lowercase 
    my $outputString = "\L$inputString"; 

    # Output final result 
    print "$outputString\n"; 
}

不完全知道爲什麼它着色一些正則表達式和評論紅色的......

這裏是線從「funnywords.txt」的幾個例子：

quée

22.

？éÉíóñúÑ¿¡

[。這？ ]

褐，阿利

來源

2012-05-25 18:56:22

你對擺脫任何剩餘的符號，但保持字母和數字改變你的最後的正則表達式從s/\W+//g到s/[^a-zA-Z0-9 ]+//g第二個問題。由於您已經對輸入的其餘部分進行了規範化，所以使用該正則表達式將刪除不是a-z，A-Z，0-9或空白的任何內容。在開始處使用[]和a ^表示您希望查找不在括號的其餘部分中的所有內容。

來源

2012-05-30 23:40:37 Zephyrie

正常化ASCII字符

回答

相關問題