我想解析UTF-8字符串到「一口大小」段。例如,我想將文本分解爲「句子」。是否有適合所有國際全站標點的字符集?
是否有與所有語言的句子結尾對應的字符(或正則表達式)的全面集合?我正在尋找能夠捕捉到拉丁時代的東西,驚歎號和審問標記,中國和日本的句號等。
類似上面的問題,但對於逗號也是一樣。
我想解析UTF-8字符串到「一口大小」段。例如,我想將文本分解爲「句子」。是否有適合所有國際全站標點的字符集?
是否有與所有語言的句子結尾對應的字符(或正則表達式)的全面集合?我正在尋找能夠捕捉到拉丁時代的東西,驚歎號和審問標記,中國和日本的句號等。
類似上面的問題,但對於逗號也是一樣。
我還沒有遇到過這些信息的彙編,我期望它是收集它的主要努力。對於一些廣泛使用的語言,您可以從「芝加哥風格手冊」中獲取信息。有關於在http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters-other.html不同語言中常用的標點符號的一些信息,但涵蓋只是一小組語言,並不區分句末字符。
僅僅使用字符是不夠的,因爲例如,在英語中,句號「。」出現在許多沒有終止句子的語境中,如「例如」或「1.5」中。
中文,日文和韓文使用。
。泰國使用空間。看到這個列表的Unicode full stop equivalents。
例如,字符DIGIT ONE FULL STOP不是等同的完整句號;它只是一個數字字符(在兼容性方面與FULL STOP相同,但肯定不會被視爲在那裏終止句子)。 – 2012-03-01 05:25:13
您需要查看代碼點,其中\p{Sentence_Break=STerm}
或\p{Sentence_Break=ATerm}
屬性也具有\p{Terminal_Punctuation}
屬性。運行the unichars script對Unicode的V6.1中,我們瞭解到這些代碼點滿足所有這些條件:
$ unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
U+00021 ! GC=Po SC=Common EXCLAMATION MARK
U+0002E . GC=Po SC=Common FULL STOP
U+0003F ? GC=Po SC=Common QUESTION MARK
U+00589 ։ GC=Po SC=Common ARMENIAN FULL STOP
U+0061F ؟ GC=Po SC=Common ARABIC QUESTION MARK
U+006D4 ۔ GC=Po SC=Arabic ARABIC FULL STOP
U+00700 ܀ GC=Po SC=Syriac SYRIAC END OF PARAGRAPH
U+00701 ܁ GC=Po SC=Syriac SYRIAC SUPRALINEAR FULL STOP
U+00702 ܂ GC=Po SC=Syriac SYRIAC SUBLINEAR FULL STOP
U+007F9 ߹ GC=Po SC=Nko NKO EXCLAMATION MARK
U+00964 । GC=Po SC=Common DEVANAGARI DANDA
U+00965 ॥ GC=Po SC=Common DEVANAGARI DOUBLE DANDA
U+0104A ၊ GC=Po SC=Myanmar MYANMAR SIGN LITTLE SECTION
U+0104B ။ GC=Po SC=Myanmar MYANMAR SIGN SECTION
U+01362 ። GC=Po SC=Ethiopic ETHIOPIC FULL STOP
U+01367 ፧ GC=Po SC=Ethiopic ETHIOPIC QUESTION MARK
U+01368 ፨ GC=Po SC=Ethiopic ETHIOPIC PARAGRAPH SEPARATOR
U+0166E ᙮ GC=Po SC=Canadian_Aboriginal CANADIAN SYLLABICS FULL STOP
U+01803 ᠃ GC=Po SC=Common MONGOLIAN FULL STOP
U+01809 ᠉ GC=Po SC=Mongolian MONGOLIAN MANCHU FULL STOP
U+01944 ᥄ GC=Po SC=Limbu LIMBU EXCLAMATION MARK
U+01945 ᥅ GC=Po SC=Limbu LIMBU QUESTION MARK
U+01AA8 ᪨ GC=Po SC=Tai_Tham TAI THAM SIGN KAAN
U+01AA9 ᪩ GC=Po SC=Tai_Tham TAI THAM SIGN KAANKUU
U+01AAA ᪪ GC=Po SC=Tai_Tham TAI THAM SIGN SATKAAN
U+01AAB ᪫ GC=Po SC=Tai_Tham TAI THAM SIGN SATKAANKUU
U+01B5A ᭚ GC=Po SC=Balinese BALINESE PANTI
U+01B5B ᭛ GC=Po SC=Balinese BALINESE PAMADA
U+01B5E ᭞ GC=Po SC=Balinese BALINESE CARIK SIKI
U+01B5F ᭟ GC=Po SC=Balinese BALINESE CARIK PAREREN
U+01C3B ᰻ GC=Po SC=Lepcha LEPCHA PUNCTUATION TA-ROL
U+01C3C ᰼ GC=Po SC=Lepcha LEPCHA PUNCTUATION NYET THYOOM TA-ROL
U+01C7E ᱾ GC=Po SC=Ol_Chiki OL CHIKI PUNCTUATION MUCAAD
U+01C7F ᱿ GC=Po SC=Ol_Chiki OL CHIKI PUNCTUATION DOUBLE MUCAAD
U+0203C ‼ GC=Po SC=Common DOUBLE EXCLAMATION MARK
U+0203D ‽ GC=Po SC=Common INTERROBANG
U+02047 ⁇ GC=Po SC=Common DOUBLE QUESTION MARK
U+02048 ⁈ GC=Po SC=Common QUESTION EXCLAMATION MARK
U+02049 ⁉ GC=Po SC=Common EXCLAMATION QUESTION MARK
U+02E2E ⸮ GC=Po SC=Common REVERSED QUESTION MARK
U+03002 。 GC=Po SC=Common IDEOGRAPHIC FULL STOP
U+0A4FF ꓿ GC=Po SC=Lisu LISU PUNCTUATION FULL STOP
U+0A60E ꘎ GC=Po SC=Vai VAI FULL STOP
U+0A60F ꘏ GC=Po SC=Vai VAI QUESTION MARK
U+0A6F3 ꛳ GC=Po SC=Bamum BAMUM FULL STOP
U+0A6F7 ꛷ GC=Po SC=Bamum BAMUM QUESTION MARK
U+0A876 ꡶ GC=Po SC=Phags_Pa PHAGS-PA MARK SHAD
U+0A877 ꡷ GC=Po SC=Phags_Pa PHAGS-PA MARK DOUBLE SHAD
U+0A8CE ꣎ GC=Po SC=Saurashtra SAURASHTRA DANDA
U+0A8CF ꣏ GC=Po SC=Saurashtra SAURASHTRA DOUBLE DANDA
U+0A92F ꤯ GC=Po SC=Kayah_Li KAYAH LI SIGN SHYA
U+0A9C8 ꧈ GC=Po SC=Javanese JAVANESE PADA LINGSA
U+0A9C9 ꧉ GC=Po SC=Javanese JAVANESE PADA LUNGSI
U+0AA5D ꩝ GC=Po SC=Cham CHAM PUNCTUATION DANDA
U+0AA5E ꩞ GC=Po SC=Cham CHAM PUNCTUATION DOUBLE DANDA
U+0AA5F ꩟ GC=Po SC=Cham CHAM PUNCTUATION TRIPLE DANDA
U+0AAF0 ꫰ GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHAN
U+0AAF1 ꫱ GC=Po SC=Meetei_Mayek MEETEI MAYEK AHANG KHUDAM
U+0ABEB ꯫ GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHEI
U+0FE52 ﹒ GC=Po SC=Common SMALL FULL STOP
U+0FE56 ﹖ GC=Po SC=Common SMALL QUESTION MARK
U+0FE57 ﹗ GC=Po SC=Common SMALL EXCLAMATION MARK
U+0FF01 ! GC=Po SC=Common FULLWIDTH EXCLAMATION MARK
U+0FF0E . GC=Po SC=Common FULLWIDTH FULL STOP
U+0FF1F ? GC=Po SC=Common FULLWIDTH QUESTION MARK
U+0FF61 。 GC=Po SC=Common HALFWIDTH IDEOGRAPHIC FULL STOP
U+11047 GC=Po SC=Brahmi BRAHMI DANDA
U+11048 GC=Po SC=Brahmi BRAHMI DOUBLE DANDA
U+110BE GC=Po SC=Kaithi KAITHI SECTION MARK
U+110BF GC=Po SC=Kaithi KAITHI DOUBLE SECTION MARK
U+110C0 GC=Po SC=Kaithi KAITHI DANDA
U+110C1 GC=Po SC=Kaithi KAITHI DOUBLE DANDA
U+11141 GC=Po SC=Chakma CHAKMA DANDA
U+11142 GC=Po SC=Chakma CHAKMA DOUBLE DANDA
U+11143 GC=Po SC=Chakma CHAKMA QUESTION MARK
U+111C5 GC=Po SC=Sharada SHARADA DANDA
U+111C6 GC=Po SC=Sharada SHARADA DOUBLE DANDA
要圍繞走另一條路 - 即得找到性能給出的,而不是給找到代碼點的代碼點屬性組 - 使用the companion uniprops script,其中翻出一個給定的代碼點的所有屬性:
$ uniprops -a . \? \!
U+002E ‹.› \N{FULL STOP}
\pP \p{Po}
All Any ASCII Assigned Basic_Latin Case_Ignorable CI Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print
X_POSIX_Punct
Age=1.1 Block=Basic_Latin Bidi_Class=Common_Separator BC=CS Bidi_Class=CS Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
Line_Break=Infix_Numeric LB=IS Line_Break=IS Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=AT Sentence_Break=ATerm SB=AT
Word_Break=MB Word_Break=MidNumLet WB=MB _Case_Ignorable _X_Begin
U+003F ‹?› \N{QUESTION MARK}
\pP \p{Po}
All Any ASCII Assigned Basic_Latin Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn
POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
Age=1.1 Block=Basic_Latin Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
Line_Break=EX Line_Break=Exclamation LB=EX Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=ST Sentence_Break=STerm SB=ST
Word_Break=Other WB=XX Word_Break=XX _X_Begin
U+0021 ‹!› \N{EXCLAMATION MARK}
\pP \p{Po}
All Any ASCII Assigned Basic_Latin Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn
POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
Age=1.1 Block=Basic_Latin Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
Line_Break=EX Line_Break=Exclamation LB=EX Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=ST Sentence_Break=STerm SB=ST
Word_Break=Other WB=XX Word_Break=XX _X_Begin
我懷疑你應該檢查更多的融入整個句子斷特性。
還有a 3rd script in the suite, uninames,該做的事情是這樣的:
$ uninames sentence
; 037E GREEK QUESTION MARK
= erotimatiko
* sentence-final punctuation
* 003B is the preferred character
x (question mark - 003F)
: 003B semicolon
⁚ 205A TWO DOT PUNCTUATION
* historically used to indicate the end of a sentence or change of speaker
* extends from baseline to cap height
x (presentation form for vertical two dot leader - FE30)
x (greek acrophonic epidaurean two - 1015B)
110BE KAITHI SECTION MARK
* marks end of sentence
我覺得這三個方案探索Unicode屬性缺一不可。您可以使用the CPAN Unicode::Tussle suite來安裝它們,或者單獨檢查它們here。
Sentence_Break屬性根據它們是否可能*終止句子或其他語法構造來分類字符。這些信息不是語言敏感的,一種語言中的句子結束符在另一種語言中可能只是一個詞分隔符。 UAX#29 http:// unicode。org/reports/tr29 /包含有關使用該信息進行文本分割的一些信息以及相當大的限制。 – 2012-03-01 05:39:59
判刑是一個難題,但是我提出了你的問題,因爲a)對於新來的人來說這並不明顯,b)學習國際站等的Unicode屬性還是有用的。 – hippietrail 2013-05-06 22:22:30