1
我使用python 2.7 nltk標記器來標記簡單的英文文本以提取每個單詞及其命名實體類別的頻率。以下的程序是用於該目的:如何處理來自不同國家的英文拼寫在Python中的nltk標記器中的差異
import re
from collections import Counter
from nltk.tag.stanford import NERTagger
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))
WORD = re.compile(r'\w+')
def main():
text = "title Optimal Play against Best Defence: Complexity and
Heuristics"
print text
words = WORD.findall(text)
print words
word_frqc = Counter(words)
tagger = ERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz",
"stanford-ner.jar")
terms = []
answer = tagger.tag(words)
print answer
for i, word_pos in enumerate(answer):
word, pos = word_pos
if pos == 'PERSON':
cat_Id = 1
elif pos == 'ORGANIZATION':
cat_Id = 2
elif pos == 'LOCATION':
cat_Id = 3
else:
cat_Id = 4
frqc =word_frqc.get(word)
terms.append((i, word, cat_Id, frqc))
print terms
if __name__ == '__main__':
main()
的程序的輸出是如下:
text = "title Optimal Play against Best **Defence:** Complexity and
Heuristics"
[(u'title', u'O'), (u'Optimal', u'O'), (u'Play', u'O'), (u'against', u'O'),
(u'Best', u'O'), (u'Defense', u'O'), (u'Complexity', u'O'), (u'and', u'O'),
(u'Heuristics', u'O')]
[(0, u'title', 4, 1), (1, u'Optimal', 4, 1), (2, u'Play', 4, 1), (3,
u'against', 4, 1), (4, u'Best', 4, 1), (5, u'**Defense**', 4, None), (6,
u'Complexity', 4, 1), (7, u'and', 4, 1), (8, u'Heuristics', 4, 1)]
有一個問題,並且由tagger.tag()方法引起的。該方法將原文中的'防禦'一詞改爲'防禦'。因此,程序無法在word_frqc中看到「防禦」一詞,因此將文本中單詞的頻率設置爲「無」。
請問有沒有辦法(在Python中)我可以讓方法不改變字?