如何使用正則表達式在文件中查找非ascii字符Python

我有一個包含[a-z]以及á，ü，ó，ñ，...等字符的字符串。目前我正在使用正則表達式來獲取包含這些字符的文件中的每一行。 spanishList.txt的如何使用正則表達式在文件中查找非ascii字符Python

樣品：

adan 
celular 
tomás 
justo 
tom 
átomo 
camara 
rosa 
avion

Python代碼（charactersToSearch來自燒瓶@application.route('/<charactersToSearch>')）：

print (charactersToSearch) 
#'átdsmjfnueó' 
... 
#encode 
charactersToSearch = charactersToSearch.encode('utf-8') 
query = re.compile('[' + charactersToSearch + ']{2,}$', re.UNICODE).match 
words = set(word.rstrip('\n') for word in open('spanishList.txt') if query(word)) 
...

當我這樣做，我期待得到在文本文件中的話其中包括charactersToSearch中的字符。它適用於沒有特殊字符的文字：

... 
#after doing further searching for other conditions, return list of found words. 
return '<br />'.join(sorted(set(word for (word, path) in solve()))) 
>>> adan 
>>> justo 
>>> tom

唯一的問題是它忽略了文件中不是ASCII的所有單詞。我也應該得到tomás和átomo。

我試過編碼，UTF-8，使用你的'[...]，但我一直無法讓它適用於所有字符。該文件和程序（# -*- coding: utf-8 -*-）也在utf-8中。

來源

2014-06-25 santybm

你試圖'查詢= re.compile（U '[' + charactersToSearch + '] {2，} $'，re.UNICODE）.match'和不編碼'charactersToSearch'爲UTF8？，但而不是把它留作unicode？ –

爲了澄清，您是否在考慮'á'是非ASCII？在擴展表格中是225。（但也可以表示爲'a' +急性口音） – zx81

@JoranBeasley是的。我已經嘗試了兩種方式，但是每次獲得沒有任何特殊字符的單詞列表。 – santybm

的是能找出問題。從燒瓶應用程序路徑獲取字符串後，對其進行編碼，否則會給出錯誤，然後解碼文件中的charactersToSearch和各個word。

charactersToSearch = charactersToSearch.encode('utf-8')

然後以UTF-8解碼。如果你離開前行了它給你一個錯誤

UNIOnlyAlphabet = charactersToSearch.decode('UTF-8') 
query = re.compile('[' + UNIOnlyAlphabet + ']{2,}$', re.U).match

最後，讀取UTF-8文件，並使用查詢時，不要忘記每一個字的文件進行解碼。

words = set(word.decode('UTF-8').rstrip('\n') for word in open('spanishList.txt') if query(word.decode('UTF-8')))

應該這樣做。現在結果顯示常規和特殊字符。

justo 
tomás 
átomo 
adan 
tom

來源

2014-06-25 20:45:34 santybm

了不同的策略

我不知道如何解決它在你當前的工作流程，所以我會建議不同的路線。

該正則表達式將匹配擴展ASCII範圍中既不是空格字符也不是字母的字符，如A和é。換句話說，如果你的一個單詞包含一個不屬於這個集合的怪異字符，則正則表達式將匹配。

(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S

當然這也會匹配標點符號，但我假設我們只查看未處理清單中的單詞。否則，排除標點符號並不難。

正如我所看到的，您的挑戰是定義您的設置。

在Python中，你可以這樣類似：

if re.search(r"(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S", subject): 
    # Successful match 
else: 
    # Match attempt failed

來源

2014-06-25 05:50:00 zx81

我覺得你的痛苦。在python2.x中處理Unicode是令人頭疼的事情。

該輸入的問題在於，python將原始字節字符串'\ xc3 \ xa1'視爲「á」而不是unicode字符「u'\ uc3a1'，因此您需要在傳遞之前清理輸入串入您的正則表達式。

要更改原始字節串Unicode字符串

char = "á" 
## print char yields the infamous, and in python unparsable "\xc3\xa1". 
## which is probably what the regex is not registering. 
bytes_in_string = [byte for byte in char] 
string = ''.join([str(hex(ord(byte))).strip('0x') for byte in bytes_in_string]) 
new_unicode_string = unichr(int(string),16))

有可能是一個更好的辦法，因爲這是很多操作得到的東西準備好正則表達式，這是我認爲在某種程度上應該比迭代&'if/else'ing更快儘管如此，不是一個經驗RT。

我使用類似的東西來隔離特殊字符的話，當我解析wiktionary這是一個邪惡的混亂。至於我可以告訴你將不得不通過梳理把它清理乾淨反正，你可能也只是：

for word in file: 
    try: 
     word.encode('UTF-8') 
    except UnicodeDecodeError: 
     your_list_of_special_char_words.append(word)

希望這有助於，祝你好運！

在進一步的研究發現，這個帖子：

Bytes in a unicode Python string

來源

2014-06-25 10:19:13

因此，當我嘗試從原始字節字符串更改爲unicode時，出現錯誤。假設'áaceimsonñpórxül'的輸入文本，'bytes_in_string'給了我：'''xc3'，'\ xa1'，'a'，'c'，'e'，'i'，'m'，'s' ，'x'，'\ xb3'，'p'，'\ xc3'，'\ xb3'，'r'，'x'，'\ xc3'，'\ xbc' ，'l']'然後字符串打印'c3a1616365696d736f6ec3b17c3b37278c3bc6c'。現在我可以看到，例如，á由\ xc3＆\ xa1組成。當我運行'new_unicode_string'時，我得到的錯誤是：'ValueError：int（）的基數爲10的無效文字：'c3a1616365696d736f6ec3b17c3b37278c3bc6c'' ...因爲它不僅僅是數字。有什麼建議麼。？ – santybm

我能解決這個問題： – santybm

如何使用正則表達式在文件中查找非ascii字符Python

回答

相關問題