2015-02-06 17 views
1

我使用python 2.7 nltk標記器來標記簡單的英文文本以提取每個單詞及其命名實體類別的頻率。以下的程序是用於該目的:如何處理來自不同國家的英文拼寫在Python中的nltk標記器中的差異

import re 
from collections import Counter 
from nltk.tag.stanford import NERTagger 
from nltk.corpus import stopwords 

stops = set(stopwords.words("english")) 

WORD = re.compile(r'\w+') 

def main(): 
    text = "title Optimal Play against Best Defence: Complexity and 
    Heuristics" 
    print text 
    words = WORD.findall(text) 
    print words 
    word_frqc = Counter(words) 

    tagger = ERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", 
    "stanford-ner.jar") 
    terms = [] 
    answer = tagger.tag(words) 
    print answer 
    for i, word_pos in enumerate(answer): 
     word, pos = word_pos 
     if pos == 'PERSON': 
      cat_Id = 1 
     elif pos == 'ORGANIZATION': 
      cat_Id = 2 
     elif pos == 'LOCATION': 
      cat_Id = 3 
     else: 
      cat_Id = 4 
     frqc =word_frqc.get(word) 
     terms.append((i, word, cat_Id, frqc)) 
print terms 
if __name__ == '__main__': 
    main() 

的程序的輸出是如下:

text = "title Optimal Play against Best **Defence:** Complexity and  
    Heuristics" 

[(u'title', u'O'), (u'Optimal', u'O'), (u'Play', u'O'), (u'against', u'O'),  
(u'Best', u'O'), (u'Defense', u'O'), (u'Complexity', u'O'), (u'and', u'O'), 
(u'Heuristics', u'O')] 

[(0, u'title', 4, 1), (1, u'Optimal', 4, 1), (2, u'Play', 4, 1), (3, 
    u'against', 4, 1), (4, u'Best', 4, 1), (5, u'**Defense**', 4, None), (6, 
    u'Complexity', 4, 1), (7, u'and', 4, 1), (8, u'Heuristics', 4, 1)] 

有一個問題,並且由tagger.tag()方法引起的。該方法將原文中的'防禦'一詞改爲'防禦'。因此,程序無法在word_frqc中看到「防禦」一詞,因此將文本中單詞的頻率設置爲「無」。

請問有沒有辦法(在Python中)我可以讓方法不改變字?

回答

相關問題