7

我想解析UTF-8字符串到「一口大小」段。例如,我想將文本分解爲「句子」。是否有適合所有國際全站標點的字符集?

是否有與所有語言的句子結尾對應的字符(或正則表達式)的全面集合?我正在尋找能夠捕捉到拉丁時代的東西,驚歎號和審問標記,中國和日本的句號等。

類似上面的問題,但對於逗號也是一樣。

+0

判刑是一個難題,但是我提出了你的問題,因爲a)對於新來的人來說這並不明顯,b)學習國際站等的Unicode屬性還是有用的。 – hippietrail 2013-05-06 22:22:30

回答

3

我還沒有遇到過這些信息的彙編,我期望它是收集它的主要努力。對於一些廣泛使用的語言,您可以從「芝加哥風格手冊」中獲取信息。有關於在http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters-other.html不同語言中常用的標點符號的一些信息,但涵蓋只是一小組語言,並不區分句末字符。

僅僅使用字符是不夠的,因爲例如,在英語中,句號「。」出現在許多沒有終止句子的語境中,如「例如」或「1.5」中。

+1

實際上它比這更糟糕,因爲有些語言甚至沒有句子標記 - 例如泰語。 – Joel 2012-02-29 21:55:57

+1

是的,我在讀連體時通常在句尾使用簡單的空格。 – JDelage 2012-02-29 22:28:12

+0

Unicode確實在其fancier屬性中包含該信息。 – tchrist 2012-03-01 00:16:11

3

中文,日文和韓文使用。泰國使用空間。看到這個列表的Unicode full stop equivalents

+1

例如,字符DIGIT ONE FULL STOP不是等同的完整句號;它只是一個數字字符(在兼容性方面與FULL STOP相同,但肯定不會被視爲在那裏終止句子)。 – 2012-03-01 05:25:13

6

您需要查看代碼點,其中\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}屬性也具有\p{Terminal_Punctuation}屬性。運行the unichars script對Unicode的V6.1中,我們瞭解到這些代碼點滿足所有這些條件:

$ unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' 
U+00021 ‭ ! GC=Po SC=Common  EXCLAMATION MARK 
U+0002E ‭ . GC=Po SC=Common  FULL STOP 
U+0003F ‭ ? GC=Po SC=Common  QUESTION MARK 
U+00589 ‭ ։ GC=Po SC=Common  ARMENIAN FULL STOP 
U+0061F ‭ ؟ GC=Po SC=Common  ARABIC QUESTION MARK 
U+006D4 ‭ ۔ GC=Po SC=Arabic  ARABIC FULL STOP 
U+00700 ‭ ܀ GC=Po SC=Syriac  SYRIAC END OF PARAGRAPH 
U+00701 ‭ ܁ GC=Po SC=Syriac  SYRIAC SUPRALINEAR FULL STOP 
U+00702 ‭ ܂ GC=Po SC=Syriac  SYRIAC SUBLINEAR FULL STOP 
U+007F9 ‭ ߹ GC=Po SC=Nko   NKO EXCLAMATION MARK 
U+00964 ‭ । GC=Po SC=Common  DEVANAGARI DANDA 
U+00965 ‭ ॥ GC=Po SC=Common  DEVANAGARI DOUBLE DANDA 
U+0104A ‭ ၊ GC=Po SC=Myanmar  MYANMAR SIGN LITTLE SECTION 
U+0104B ‭ ။ GC=Po SC=Myanmar  MYANMAR SIGN SECTION 
U+01362 ‭ ። GC=Po SC=Ethiopic  ETHIOPIC FULL STOP 
U+01367 ‭ ፧ GC=Po SC=Ethiopic  ETHIOPIC QUESTION MARK 
U+01368 ‭ ፨ GC=Po SC=Ethiopic  ETHIOPIC PARAGRAPH SEPARATOR 
U+0166E ‭ ᙮ GC=Po SC=Canadian_Aboriginal CANADIAN SYLLABICS FULL STOP 
U+01803 ‭ ᠃ GC=Po SC=Common  MONGOLIAN FULL STOP 
U+01809 ‭ ᠉ GC=Po SC=Mongolian MONGOLIAN MANCHU FULL STOP 
U+01944 ‭ ᥄ GC=Po SC=Limbu  LIMBU EXCLAMATION MARK 
U+01945 ‭ ᥅ GC=Po SC=Limbu  LIMBU QUESTION MARK 
U+01AA8 ‭ ᪨ GC=Po SC=Tai_Tham  TAI THAM SIGN KAAN 
U+01AA9 ‭ ᪩ GC=Po SC=Tai_Tham  TAI THAM SIGN KAANKUU 
U+01AAA ‭ ᪪ GC=Po SC=Tai_Tham  TAI THAM SIGN SATKAAN 
U+01AAB ‭ ᪫ GC=Po SC=Tai_Tham  TAI THAM SIGN SATKAANKUU 
U+01B5A ‭ ᭚ GC=Po SC=Balinese  BALINESE PANTI 
U+01B5B ‭ ᭛ GC=Po SC=Balinese  BALINESE PAMADA 
U+01B5E ‭ ᭞ GC=Po SC=Balinese  BALINESE CARIK SIKI 
U+01B5F ‭ ᭟ GC=Po SC=Balinese  BALINESE CARIK PAREREN 
U+01C3B ‭ ᰻ GC=Po SC=Lepcha  LEPCHA PUNCTUATION TA-ROL 
U+01C3C ‭ ᰼ GC=Po SC=Lepcha  LEPCHA PUNCTUATION NYET THYOOM TA-ROL 
U+01C7E ‭ ᱾ GC=Po SC=Ol_Chiki  OL CHIKI PUNCTUATION MUCAAD 
U+01C7F ‭ ᱿ GC=Po SC=Ol_Chiki  OL CHIKI PUNCTUATION DOUBLE MUCAAD 
U+0203C ‭ ‼ GC=Po SC=Common  DOUBLE EXCLAMATION MARK 
U+0203D ‭ ‽ GC=Po SC=Common  INTERROBANG 
U+02047 ‭ ⁇ GC=Po SC=Common  DOUBLE QUESTION MARK 
U+02048 ‭ ⁈ GC=Po SC=Common  QUESTION EXCLAMATION MARK 
U+02049 ‭ ⁉ GC=Po SC=Common  EXCLAMATION QUESTION MARK 
U+02E2E ‭ ⸮ GC=Po SC=Common  REVERSED QUESTION MARK 
U+03002 ‭ 。 GC=Po SC=Common  IDEOGRAPHIC FULL STOP 
U+0A4FF ‭ ꓿ GC=Po SC=Lisu   LISU PUNCTUATION FULL STOP 
U+0A60E ‭ ꘎ GC=Po SC=Vai   VAI FULL STOP 
U+0A60F ‭ ꘏ GC=Po SC=Vai   VAI QUESTION MARK 
U+0A6F3 ‭ ꛳ GC=Po SC=Bamum  BAMUM FULL STOP 
U+0A6F7 ‭ ꛷ GC=Po SC=Bamum  BAMUM QUESTION MARK 
U+0A876 ‭ ꡶ GC=Po SC=Phags_Pa  PHAGS-PA MARK SHAD 
U+0A877 ‭ ꡷ GC=Po SC=Phags_Pa  PHAGS-PA MARK DOUBLE SHAD 
U+0A8CE ‭ ꣎ GC=Po SC=Saurashtra SAURASHTRA DANDA 
U+0A8CF ‭ ꣏ GC=Po SC=Saurashtra SAURASHTRA DOUBLE DANDA 
U+0A92F ‭ ꤯ GC=Po SC=Kayah_Li  KAYAH LI SIGN SHYA 
U+0A9C8 ‭ ꧈ GC=Po SC=Javanese  JAVANESE PADA LINGSA 
U+0A9C9 ‭ ꧉ GC=Po SC=Javanese  JAVANESE PADA LUNGSI 
U+0AA5D ‭ ꩝ GC=Po SC=Cham   CHAM PUNCTUATION DANDA 
U+0AA5E ‭ ꩞ GC=Po SC=Cham   CHAM PUNCTUATION DOUBLE DANDA 
U+0AA5F ‭ ꩟ GC=Po SC=Cham   CHAM PUNCTUATION TRIPLE DANDA 
U+0AAF0 ‭ ꫰ GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHAN 
U+0AAF1 ‭ ꫱ GC=Po SC=Meetei_Mayek MEETEI MAYEK AHANG KHUDAM 
U+0ABEB ‭ ꯫ GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHEI 
U+0FE52 ‭ ﹒ GC=Po SC=Common  SMALL FULL STOP 
U+0FE56 ‭ ﹖ GC=Po SC=Common  SMALL QUESTION MARK 
U+0FE57 ‭ ﹗ GC=Po SC=Common  SMALL EXCLAMATION MARK 
U+0FF01 ‭ ! GC=Po SC=Common  FULLWIDTH EXCLAMATION MARK 
U+0FF0E ‭ . GC=Po SC=Common  FULLWIDTH FULL STOP 
U+0FF1F ‭ ? GC=Po SC=Common  FULLWIDTH QUESTION MARK 
U+0FF61 ‭ 。 GC=Po SC=Common  HALFWIDTH IDEOGRAPHIC FULL STOP 
U+11047 ‭ GC=Po SC=Brahmi  BRAHMI DANDA 
U+11048 ‭ GC=Po SC=Brahmi  BRAHMI DOUBLE DANDA 
U+110BE ‭ GC=Po SC=Kaithi  KAITHI SECTION MARK 
U+110BF ‭ GC=Po SC=Kaithi  KAITHI DOUBLE SECTION MARK 
U+110C0 ‭ GC=Po SC=Kaithi  KAITHI DANDA 
U+110C1 ‭ GC=Po SC=Kaithi  KAITHI DOUBLE DANDA 
U+11141 ‭ GC=Po SC=Chakma  CHAKMA DANDA 
U+11142 ‭ GC=Po SC=Chakma  CHAKMA DOUBLE DANDA 
U+11143 ‭ GC=Po SC=Chakma  CHAKMA QUESTION MARK 
U+111C5 ‭ GC=Po SC=Sharada  SHARADA DANDA 
U+111C6 ‭ GC=Po SC=Sharada  SHARADA DOUBLE DANDA 

要圍繞走另一條路 - 即得找到性能給出的,而不是給找到代碼點的代碼點屬性組 - 使用the companion uniprops script,其中翻出一個給定的代碼點的所有屬性:

$ uniprops -a . \? \! 
U+002E ‹.› \N{FULL STOP} 
    \pP \p{Po} 
    All Any ASCII Assigned Basic_Latin Case_Ignorable CI Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn 
     Pattern_Syntax PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print 
     X_POSIX_Punct 
    Age=1.1 Block=Basic_Latin Bidi_Class=Common_Separator BC=CS Bidi_Class=CS Block=ASCII BLK=ASCII Canonical_Combining_Class=0 
     Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na 
     East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA 
     Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U 
     Line_Break=Infix_Numeric LB=IS Line_Break=IS Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 
     Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 
     IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=AT Sentence_Break=ATerm SB=AT 
     Word_Break=MB Word_Break=MidNumLet WB=MB _Case_Ignorable _X_Begin 
U+003F ‹?› \N{QUESTION MARK} 
    \pP \p{Po} 
    All Any ASCII Assigned Basic_Latin Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn 
     POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct 
    Age=1.1 Block=Basic_Latin Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=ASCII BLK=ASCII Canonical_Combining_Class=0 
     Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na 
     East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA 
     Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U 
     Line_Break=EX Line_Break=Exclamation LB=EX Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 
     Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 
     IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=ST Sentence_Break=STerm SB=ST 
     Word_Break=Other WB=XX Word_Break=XX _X_Begin 
U+0021 ‹!› \N{EXCLAMATION MARK} 
    \pP \p{Po} 
    All Any ASCII Assigned Basic_Latin Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn 
     POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct 
    Age=1.1 Block=Basic_Latin Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=ASCII BLK=ASCII Canonical_Combining_Class=0 
     Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na 
     East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA 
     Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U 
     Line_Break=EX Line_Break=Exclamation LB=EX Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 
     Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 
     IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=ST Sentence_Break=STerm SB=ST 
     Word_Break=Other WB=XX Word_Break=XX _X_Begin 

我懷疑你應該檢查更多的融入整個句子斷特性。

還有a 3rd script in the suite, uninames,該做的事情是這樣的:

$ uninames sentence 
; 037E  GREEK QUESTION MARK 
     = erotimatiko 
     * sentence-final punctuation 
     * 003B is the preferred character 
     x (question mark - 003F) 
     : 003B semicolon 
⁚ 205A  TWO DOT PUNCTUATION 
     * historically used to indicate the end of a sentence or change of speaker 
     * extends from baseline to cap height 
     x (presentation form for vertical two dot leader - FE30) 
     x (greek acrophonic epidaurean two - 1015B) 
    110BE  KAITHI SECTION MARK 
     * marks end of sentence 

我覺得這三個方案探索Unicode屬性缺一不可。您可以使用the CPAN Unicode::Tussle suite來安裝它們,或者單獨檢查它們here

+3

Sentence_Break屬性根據它們是否可能*終止句子或其他語法構造來分類字符。這些信息不是語言敏感的,一種語言中的句子結束符在另一種語言中可能只是一個詞分隔符。 UAX#29 http:// unicode。org/reports/tr29 /包含有關使用該信息進行文本分割的一些信息以及相當大的限制。 – 2012-03-01 05:39:59

相關問題