2016-12-26 126 views
1

我想將西班牙語句子標記爲單詞。以下是正確的方法還是有更好的方法來做到這一點?西班牙語單詞記號器

import nltk 
from nltk.tokenize import word_tokenize 

def spanish_word_tokenize(s): 
    for w in word_tokenize(s):  
     if w[0] in ("¿","¡"): 
      yield w[0] 
      yield w[1:] 
     else: 
      yield w   

sentences = "¿Quién eres tú? ¡Hola! ¿Dónde estoy?" 

spanish_sentence_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle') 

sentences = spanish_sentence_tokenizer.tokenize(sentences) 
for s in sentences: 
    print([s for s in spanish_word_tokenize(s)]) 
+0

對我來說很好,但你需要要做到這一點看起來更像是在nltk部分的錯誤,也許你應該向他們報告 – Copperfield

+0

請避免SO和github之間關於你的問題的交叉發佈,https://github.com/nltk/nltk/issues/1558 – alvas

回答

1

C.f. NLTK github問題#1214,在NLTK中有相當多的替代標記器=)

例如,的@jonsafari toktok tokenizer使用NLTK端口:

>>> import nltk 
>>> nltk.download('perluniprops') 
[nltk_data] Downloading package perluniprops to 
[nltk_data]  /Users/liling.tan/nltk_data... 
[nltk_data] Package perluniprops is already up-to-date! 
True 
>>> nltk.download('nonbreaking_prefixes') 
[nltk_data] Downloading package nonbreaking_prefixes to 
[nltk_data]  /Users/liling.tan/nltk_data... 
[nltk_data] Package nonbreaking_prefixes is already up-to-date! 
True 
>>> from nltk.tokenize.toktok import ToktokTokenizer 
>>> toktok = ToktokTokenizer() 
>>> sent = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?" 
>>> toktok.tokenize(sent) 
[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?', u'\xa1Hola', u'!', u'\xbf', u'D\xf3nde', u'estoy', u'?'] 
>>> print " ".join(toktok.tokenize(sent)) 
¿ Quién eres tú ? ¡Hola ! ¿ Dónde estoy ? 

>>> from nltk import sent_tokenize 
>>> sentences = u"¿Quién eres tú? ¡Hola! ¿Dónde estoy?" 
>>> [toktok.tokenize(sent) for sent in sent_tokenize(sentences, language='spanish')] 
[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']] 

>>> print '\n'.join([' '.join(toktok.tokenize(sent)) for sent in sent_tokenize(sentences, language='spanish')]) 
¿ Quién eres tú ? 
¡Hola ! 
¿ Dónde estoy ? 

如果破解代碼一點,在https://github.com/nltk/nltk/blob/develop/nltk/tokenize/toktok.py#L51添加u'\xa1',你應該能夠獲得:

[[u'\xbf', u'Qui\xe9n', u'eres', u't\xfa', u'?'], [u'\xa1', u'Hola', u'!'], [u'\xbf', u'D\xf3nde', u'estoy', u'?']]