0
兩臺運行Ubuntu 14.04.1的機器。相同的源代碼在相同的數據上運行。一個工作正常,一個拋出編解碼器解碼0xe2錯誤。爲什麼是這樣? (更重要的是,我該如何解決這個問題?)兩臺不同機器上的相同python源代碼產生不同的行爲
問題的代碼似乎是:
def tokenize(self):
"""Tokenizes text using NLTK's tokenizer, starting with sentence tokenizing"""
tokenized=''
for sentence in sent_tokenize(self):
tokenized += ' '.join(word_tokenize(sentence)) + '\n'
return Text(tokenized)
OK ......我進入交互模式和進口sent_tokenize從nltk.tokenize在兩臺機器上。該工程的一個很高興有以下:
>>> fh = open('in/train/legal/legal1a_lm_7.txt')
>>> foo = fh.read()
>>> fh.close()
>>> sent_tokenize(foo)
的UnicodeDecodeError錯誤的機器上的問題給出了下面的回溯:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 355, in _pair_iter
for el in it:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
打破輸入一行文件下行線(通過分裂('\ N')),並運行每一個通過sent_tokenize使我們出錯行:
If you have purchased these Services directly from Cisco Systems, Inc. (「Cisco」), this document is incorporated into your Master Services Agreement or equivalent services agreement (「MSA」) executed between you and Cisco.
這實際上是:
>>> bar[5]
'If you have purchased these Services directly from Cisco Systems, Inc. (\xe2\x80\x9cCisco\xe2\x80\x9d), this document is incorporated into your Master Services Agreement or equivalent services agreement (\xe2\x80\x9cMSA\xe2\x80\x9d) executed between you and Cisco.'
更新:兩臺機器顯示的UnicodeDecodeError爲:
unicode(bar[5])
但只有一臺機器顯示了一個錯誤:
sent_tokenize(bar[5])
請向我們展示引發異常的代碼,以及觸發它的輸入數據和完整回溯。 – 2014-12-03 17:22:30
您仍然需要包含回溯和樣本數據。編輯代碼片段的 – 2014-12-03 17:31:57
。整個項目都在Tk中,所以我會盡量追溯回溯,但可能需要一些時間。看了這段代碼後,我想知道是否將字符串更改爲unicode(u''&u'\ n')可能不是一個好主意...... – dbl 2014-12-03 17:32:08