Perl的非英語字符

見這一塊的Perl代碼：Perl的非英語字符

#!/usr/bin/perl -w -CS 

use feature 'unicode_strings'; 

open IN, "<", "wiki.txt"; 
open OUT, ">", "wikicorpus.txt"; 

binmode(IN, ':utf8'); 
binmode(OUT, ':utf8'); 

## Condition plain text English sentences or word lists into a form suitable for constructing a vocabulary and language model 

while (<IN>) { 

    # Remove starting and trailing tags (e.g. <s>) 
    # s/\<[a-z\/]+\>//g; 

    # Remove ellipses 
    s/\.\.\./ /g; 

    # Remove unicode 2500 (hex E2 94 80) used as something like an m-dash between words 
    # Unicode 2026 (horizontal ellipsis) 
    # Unicode 2013 and 2014 (m- and n-dash) 
    s/[\x{2500}\x{2026}\x{2013}\x{2014}]/ /g; 

    # Remove dashes surrounded by spaces (e.g. phrase - phrase) 
    s/\s-+\s/ /g; 

    # Remove dashes between words with no spaces (e.g. word--word) 
    s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g; 

    # Remove dash at a word end (e.g. three- to five-year) 
    s/(\w)-\s/$1 /g; 

    # Remove some punctuation 
    s/([\"\?,;:%???!()\[\]{}<>_\.])/ /g; 

    # Remove quotes 
    s/[\p{Initial_Punctuation}\p{Final_Punctuation}]/ /g; 

    # Remove trailing space 
    s/ $//; 

    # Remove double single-quotes 
    s/''//g; 
    s/ ''/ /g; 

    # Replace accented e with normal e for consistency with the CMU pronunciation dictionary 
    s/?/e/g; 

    # Remove single quotes used as quotation marks (e.g. some 'phrase in quotes') 
    s/\s'([\w\s]+[\w])'\s/ $1 /g; 

    # Remove double spaces 
    s/\s+/ /g; 

    # Remove leading space 
    s/^\s+//; 

    chomp($_); 

    print OUT uc($_) . "\n"; 
# print uc($_) . " "; 
} print OUT "\n";

似乎有上線49非英文字符，即行s/?/e/g;。所以當我運行這個，警告出來Quantifier follows nothing in regex;。

我該如何處理這個問題？如何讓Perl識別角色？我必須用perl 5.10來運行這段代碼。

另一個小問題是，第一行中「-CS」的含義是什麼。

感謝所有。

來源

2012-08-16 Denzel

'？'不是？在文件中標記爲最初寫入的文件時，該文件可能以某種方式由於某處失敗的字符集轉換而損壞。 – OmnipotentEntity 2012-08-16 05:08:29

'-CS'表示STDOUT，STDERR和STDIN被假定爲utf-8 – OmnipotentEntity 2012-08-16 05:12:35

@OmnipotentEntity請參閱說明，我猜？應該是重音e。我該如何修改？ – Denzel 2012-08-16 05:16:44

我認爲你的問題在於你的編輯器沒有處理unicode字符，所以程序在它進入perl之前就被破壞了，因爲這顯然不是你的程序，它可能會在它到達你之前被破壞。

在整個工具鏈正確處理unicode之前，必須小心地以保留它們的方式編碼非ascii字符。這是一種痛苦，並不存在簡單的解決方案。請參閱perl手冊以瞭解如何安全地嵌入unicode字符。

來源

2012-08-16 05:16:31 ddyer

是的，當編碼問題到來時，它會變得痛苦。你的解釋是鼓舞人心的。謝謝 – Denzel 2012-08-16 05:18:58

根據錯誤行之前的註釋行，要替換的字符是帶重音的「e」;大概是什麼意思，帶有尖銳的口音：「é」。假設你的輸入是Unicode，它可以用Perl表示爲\x{00E9}。另請參閱http://www.fileformat.info/info/unicode/char/e9/index.htm

我想你從一個服務器上的網頁複製/粘貼這個腳本，這個服務器沒有正確配置以顯示所需的字符編碼。另請參閱http://en.wikipedia.org/wiki/Mojibake

來源

2012-08-16 06:09:19 tripleee

沒錯。複製粘貼是一場噩夢。 – Denzel 2012-08-16 19:10:03

Perl的非英語字符

回答

相關問題