在python中，提取非英文單詞

我有一個包含英文字符和其他語言字符的文本文件。並使用下面的代碼，我想從這個文件是不是英語特別是韓國（Unicode的範圍從AC00在UTF-8 D7AF）在python中，提取非英文單詞

有沒有辦法做到這一點的代碼中簡單的提取一些的話嗎？？

我還需要做點別的嗎？

.... 
text = f.read() 
words = re.findall(r'\w+', dataString) 
f.close() 
....

來源

2014-04-01 user3473222

利用資本\W =匹配一個非 -alphanumeric字符，排除_。

>>> re.findall('[\W]+', u"# @, --►(Q1)-grijesh--b----►((Qf)), "); 
[u'# @, --\u25ba(', u')-', u'--', u'----\u25ba((', u')), ']

來源：Unicode HOWTO?要閱讀unicoded文本文件使用：

import codecs 
f = codecs.open('unicode.rst', encoding='utf-8') 
for l in f: 
    # regex code here

我有一個文件：

:~$ cat file 
# @, --►(Q1)-grijesh--b----►((Qf)),

在Python閱讀它：

>>> import re 
>>> import codecs 
>>> f = codecs.open('file', encoding='utf-8') 
>>> for l in f: 
... print re.findall('[\W]+', l) 
... 
[u'# @, --\u25ba(', u')-', u'--', u'----\u25ba((', u')),\n'] 
>>>

閱讀字母詞嘗試

>>> f = codecs.open('file', encoding='utf-8') 
>>> for l in f: 
... print re.findall('[^\W]+', l) 
... 
[u'Q1', u'grijesh', u'b', u'Qf']

注：小\w相匹配的字母數字字符，包括_。

來源

2014-04-01 15:30:57

非常感謝你* _ * – user3473222

要找到從AC00到D7AF範圍內的所有字符：

import re 

L = re.findall(u'[\uac00-\ud7af]+', data.decode('utf-8'))

要查找所有非ASCII字符：

import re 

def isascii(word): 
    return all(ord(c) < 128 for c in word) 

words = re.findall(u'\w+', data.decode('utf-8')) 
non_ascii_words = [w for w in words if not isascii(w)]

來源

2014-04-01 18:32:29 jfs

在python中，提取非英文單詞

回答

相關問題