latin1的正則表達式和NLTK

我想標記葡萄牙語中的一些文本。我認爲我幾乎所有事情都做對了，但我有一個問題，我無法意識到可能會出現什麼問題。我想這樣的代碼：latin1的正則表達式和NLTK

text = '''Família S.A. dispõe de $12.400 milhões para concorrência. A 
âncora desse négócio é conhecida no coração do Órgão responsável. ''' 
    pattern = r'''(?x) # set flag to allow verbose regexps 
     ([A-Z]\.)+  # abbreviations, e.g. U.S.A. 
     | \w+(-\w+)*  # words with optional internal hyphens 
     | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82% 
     | \.\.\.   # ellipsis 
     | [][.,;"'?():-_`] # these are separate tokens; includes ], [ 
     ''' 

    print nltk.regexp_tokenize(text, pattern,flags=re.UNICODE)

而得到這樣的結果：

['Fam\xc3', 'lia', 'S.A.', 'disp\xc3\xb5e', 'de', '$12.400', 'milh\xc3\xb5es', 'para', 'concorr\xc3\xaancia', '.', 'A', '\xc3', 'ncora', 'desse', 'n\xc3', 'g\xc3\xb3cio', '\xc3', 'conhecida', 'no', 'cora\xc3', '\xc3', 'o', 'do', '\xc3', 'rg\xc3', 'o', 'respons\xc3', 'vel', '.']

它確實如預期在某些方面的工作，但拆分其他人一樣[「堂」 =「祕境\ XC3」，'lia']或['coração'='cora \ xc3'，'\ xc3'，'o']。

任何幫助？

來源

2014-10-16 Marcelo

您是否嘗試過'for text.split（）：print w'？ – kums 2014-10-16 22:17:27

@kums，我知道split（）函數完成這項工作，但當我們有像「。，;？」這樣的標點符號時，它也會失敗。此外，我想得到這個正則表達式解決方案，因爲它似乎是IMO，非常靈活。 – Marcelo 2014-10-17 12:01:40

你使用什麼編碼？當我在UTF-8設置爲默認編碼的GUI中運行它時，你的代碼對我來說工作正常。你的問題似乎是一個編碼問題，而不是你的代碼本身的問題。 – 2014-10-20 01:31:01

如果某人遇到同樣的問題，只需更改默認enconding即可。對於葡萄牙語，我使用'latin-1'套裝，並在打印單詞時使用它進行解碼以獲得正確的字符。檢查了這一點：

#!/usr/bin/env python 
# -*- coding: latin-1 -*- 
""" Spliting text in portuguese (enconding 'latin-1') using regex. 
""" 
import nltk 
import re 

print "\n****** Using Regex to tokenize ******" 
text = '''Família-Empresa S.A. dispõe de $12.400 milhões para concorrência. A 
âncora, desse negócio, é conhecida no coração do Órgão responsável. ''' 
pattern = r'''(?x) # set flag to allow verbose regexps 
    ([A-Z]\.)+  # abbreviations, e.g. U.S.A. 
    | \w+(-\w+)*  # words with optional internal hyphens 
    | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82% 
    | \.\.\.   # ellipsis 
    | [][.,;"'?():-_`] # these are separate tokens; includes ], [ 
    ''' 
result = nltk.regexp_tokenize(text, pattern, flags=re.UNICODE) 
for w in result: 
    print w.decode('latin-1') 

print result

結果是：

****** Using Regex to tokenize ****** 
Família-Empresa 
S.A. 
dispõe 
de 
$12.400 
milhões 
para 
concorrência 
. 
A 
âncora 
, 
desse 
negócio 
, 
é 
conhecida 
no 
coração 
do 
Órgão 
responsável 
. 
['Fam\xedlia-Empresa', 'S.A.', 'disp\xf5e', 'de', '$12.400', 'milh\xf5es', 'para', 'concorr\xeancia', '.', 'A', '\xe2ncora', ',', 'desse', 'neg\xf3cio', ',', '\xe9', 'conhecida', 'no', 'cora\xe7\xe3o', 'do', '\xd3rg\xe3o', 'respons\xe1vel', '.']

感謝名單，以@JustinBarber爲提供了一些線索來解決這個問題的評論。

這就是所有人！

來源

2014-10-20 11:21:58 Marcelo

latin1的正則表達式和NLTK

回答

相關問題