見這一塊的Perl代碼:Perl的非英語字符
#!/usr/bin/perl -w -CS
use feature 'unicode_strings';
open IN, "<", "wiki.txt";
open OUT, ">", "wikicorpus.txt";
binmode(IN, ':utf8');
binmode(OUT, ':utf8');
## Condition plain text English sentences or word lists into a form suitable for constructing a vocabulary and language model
while (<IN>) {
# Remove starting and trailing tags (e.g. <s>)
# s/\<[a-z\/]+\>//g;
# Remove ellipses
s/\.\.\./ /g;
# Remove unicode 2500 (hex E2 94 80) used as something like an m-dash between words
# Unicode 2026 (horizontal ellipsis)
# Unicode 2013 and 2014 (m- and n-dash)
s/[\x{2500}\x{2026}\x{2013}\x{2014}]/ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\?,;:%???!()\[\]{}<>_\.])/ /g;
# Remove quotes
s/[\p{Initial_Punctuation}\p{Final_Punctuation}]/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/''//g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/?/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/\s+/ /g;
# Remove leading space
s/^\s+//;
chomp($_);
print OUT uc($_) . "\n";
# print uc($_) . " ";
} print OUT "\n";
似乎有上線49非英文字符,即行s/?/e/g;
。 所以當我運行這個,警告出來Quantifier follows nothing in regex;
。
我該如何處理這個問題?如何讓Perl識別角色?我必須用perl 5.10來運行這段代碼。
另一個小問題是,第一行中「-CS」的含義是什麼。
感謝所有。
'?'不是?在文件中標記爲最初寫入的文件時,該文件可能以某種方式由於某處失敗的字符集轉換而損壞。 – OmnipotentEntity 2012-08-16 05:08:29
'-CS'表示STDOUT,STDERR和STDIN被假定爲utf-8 – OmnipotentEntity 2012-08-16 05:12:35
@OmnipotentEntity請參閱說明,我猜?應該是重音e。我該如何修改? – Denzel 2012-08-16 05:16:44