我在將一般文件讀入我編寫的程序時遇到了一些麻煩。我目前遇到的問題是pdf基於某種突變的utf-8,其中包含一個BOM,它會在我的整個操作中引發一個扳機。在我的應用程序中,我使用了需要ASCII輸入的雪球干擾算法。有一些話題涉及到向utf-8解決錯誤的問題,但是他們都不涉及將它們發送到Snowball算法,或者考慮ascii是我想要的最終結果。目前我使用的文件是使用標準ANSI編碼的記事本文件。特定的錯誤消息我得到的是這樣的:UnicodeDecodeError,ascii處理python中的Snowball干擾算法
File "C:\Users\svictoroff\Desktop\Alleyoop\Python_Scripts\Keywords.py", line 38, in Map_Sentence_To_Keywords
Word = Word.encode('ascii', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128)
我的理解是蟒蛇內,包括忽略的說法只會越過遇到並以這種方式我會繞過任何BOM或特殊字符的任何非ASCII字符,但顯然情況並非如此。所謂實際的代碼是在這裏:
def Map_Sentence_To_Keywords(Sentence, Keywords):
'''Takes in a sentence and a list of Keywords, returns a tuple where the
first element is the sentence, and the second element is a set of
all keywords appearing in the sentence. Uses Snowball algorithm'''
Equivalence = stem.SnowballStemmer('english')
Found = []
Sentence = re.sub(r'^(\W*?)(.*)(\n?)$', r'\2', Sentence)
Words = Sentence.split()
for Word in Words:
Word = Word.lower().strip()
Word = Word.encode('ascii', 'ignore')
Word = Equivalence.stem(Word)
Found.append(Word)
return (Sentence, Found)
通過包括普通非貪婪非字符正則表達式遣返我也希望麻煩字符將被刪除字符串的前面,但這又並非如此。除了ascii之外,我還嘗試了一些其他編碼,並且嚴格的base64編碼工作,但對於我的應用程序來說非常不理想。有關如何以自動方式解決此問題的任何想法?
元素的初始解碼失敗,但實際傳遞給編碼器時返回unicode錯誤。
for Element in Curriculum_Elements:
try:
Element = Element.decode('utf-8-sig')
except:
print Element
Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))
def scraping(File):
'''Takes in txt file of curriculum, removes all newlines and returns that occur \
after a lowercase character, then splits at all remaining newlines'''
Curriculum_Elements = []
Document = open(File, 'rb').read()
Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
Curriculum_Elements = Document.split('\r\n')
return Curriculum_Elements
顯示的代碼生成看到的課程元素。
for Element in Curriculum_Elements:
try:
Element = unicode(Element, 'utf-8-sig', 'ignore')
except:
print Element
這種類型轉換hackaround實際上起作用,但隨後轉換回ascii有點笨拙。返回此錯誤:
Warning (from warnings module):
File "C:\Python27\lib\encodings\utf_8_sig.py", line 19
if input[:3] == codecs.BOM_UTF8:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
你確定它應該不貪心?假設你有''A^AHello,World!''。然後,由於非貪婪,在第一次捕獲中匹配''^ A''的** none **。他們最終在第二次捕獲,因此,在你的替代字符串。 –
好點,但我的問題是真的只有在字符不能被識別的情況下。 –