我正在做一些NLTK書中關於從網絡和磁盤獲取文本(第3章)的練習。當調用word_tokenize時,我得到一個錯誤。錯誤使用nltk word_tokenize
這是我的代碼:
>>> import nltk
>>> from urllib.request import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> tokens = nltk.word_tokenize(raw)
這是回溯:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
for sl1, sl2 in _pair_iter(slices):
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
prev = next(it)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1289, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: cannot use a string pattern on a bytes-like object
>>> File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 109, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
return tokenizer.tokenize(text)
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in <listcomp>
return [(sl.start, sl.stop) for sl in slices]
File "C:\Users\u0084411\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries
有人可以解釋我對我這到底是怎麼回事,爲什麼我似乎無法正常使用word_tokenize ?
非常感謝!
你在讀什麼版本的NLTK書? [在線版本](http://www.nltk.org/book/ch03.html#electronic-books)將'.decode(「utf8」)'應用於'read()'結果(它的工作原理是相當於接受的答案)。 – alexis
印版嚴重過時; P有時網上也有:http://www.nltk.org/book =( – alvas
我正在閱讀打印版本 –