2017-03-09 50 views
3

我正在做一些NLTK書中關於從網絡和磁盤獲取文本(第3章)的練習。當調用word_tokenize時,我得到一個錯誤。錯誤使用nltk word_tokenize

這是我的代碼:

>>> import nltk 
>>> from urllib.request import urlopen 
>>> url = "http://www.gutenberg.org/files/2554/2554.txt" 
>>> raw = urlopen(url).read() 
>>> tokens = nltk.word_tokenize(raw) 

這是回溯:

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    for sl1, sl2 in _pair_iter(slices): 
    File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter 
    prev = next(it) 
    File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text 
    for match in self._lang_vars.period_context_re().finditer(text): 
TypeError: cannot use a string pattern on a bytes-like object 
>>> File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 109, in word_tokenize 
    return [token for sent in sent_tokenize(text, language) 
    File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize 
    return tokenizer.tokenize(text) 
    File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize 
    return list(self.sentences_from_text(text, realign_boundaries)) 
    File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text 
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 
    File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize 
    return [(sl.start, sl.stop) for sl in slices] 
    File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp> 
    return [(sl.start, sl.stop) for sl in slices] 
    File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries 

有人可以解釋我對我這到底是怎麼回事,爲什麼我似乎無法正常使用word_tokenize ?

非常感謝!

+2

你在讀什麼版本的NLTK書? [在線版本](http://www.nltk.org/book/ch03.html#electronic-books)將'.decode(「utf8」)'應用於'read()'結果(它的工作原理是相當於接受的答案)。 – alexis

+0

印版嚴重過時; P有時網上也有:http://www.nltk.org/book =( – alvas

+0

我正在閱讀打印版本 –

回答

2

你必須HTML(這是作爲一個字節對象獲得)轉換爲用decode('utf-8')字符串:

>>> import nltk 
>>> from urllib.request import urlopen 
>>> url = "http://www.gutenberg.org/files/2554/2554.txt" 
>>> raw = urlopen(url).read() 
>>> raw = raw.decode('utf-8') 
>>> tokens = nltk.word_tokenize(raw) 
+0

謝謝,這解決了我的問題 –

+0

這些是明文(utf)文件,而不是html。 – alexis