如何處理來自不同國家的英文拼寫在Python中的nltk標記器中的差異

我使用python 2.7 nltk標記器來標記簡單的英文文本以提取每個單詞及其命名實體類別的頻率。以下的程序是用於該目的：如何處理來自不同國家的英文拼寫在Python中的nltk標記器中的差異

import re 
from collections import Counter 
from nltk.tag.stanford import NERTagger 
from nltk.corpus import stopwords 

stops = set(stopwords.words("english")) 

WORD = re.compile(r'\w+') 

def main(): 
    text = "title Optimal Play against Best Defence: Complexity and 
    Heuristics" 
    print text 
    words = WORD.findall(text) 
    print words 
    word_frqc = Counter(words) 

    tagger = ERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", 
    "stanford-ner.jar") 
    terms = [] 
    answer = tagger.tag(words) 
    print answer 
    for i, word_pos in enumerate(answer): 
     word, pos = word_pos 
     if pos == 'PERSON': 
      cat_Id = 1 
     elif pos == 'ORGANIZATION': 
      cat_Id = 2 
     elif pos == 'LOCATION': 
      cat_Id = 3 
     else: 
      cat_Id = 4 
     frqc =word_frqc.get(word) 
     terms.append((i, word, cat_Id, frqc)) 
print terms 
if __name__ == '__main__': 
    main()

的程序的輸出是如下：

text = "title Optimal Play against Best **Defence:** Complexity and  
    Heuristics" 

[(u'title', u'O'), (u'Optimal', u'O'), (u'Play', u'O'), (u'against', u'O'),  
(u'Best', u'O'), (u'Defense', u'O'), (u'Complexity', u'O'), (u'and', u'O'), 
(u'Heuristics', u'O')] 

[(0, u'title', 4, 1), (1, u'Optimal', 4, 1), (2, u'Play', 4, 1), (3, 
    u'against', 4, 1), (4, u'Best', 4, 1), (5, u'**Defense**', 4, None), (6, 
    u'Complexity', 4, 1), (7, u'and', 4, 1), (8, u'Heuristics', 4, 1)]

有一個問題，並且由tagger.tag（）方法引起的。該方法將原文中的'防禦'一詞改爲'防禦'。因此，程序無法在word_frqc中看到「防禦」一詞，因此將文本中單詞的頻率設置爲「無」。

請問有沒有辦法（在Python中）我可以讓方法不改變字？

來源

2015-02-06 user3422243

我遇到了同樣的問題。

嘗試安裝地理與

PIP安裝地理。

檢查github repo here

來源

2016-01-22 03:53:49 llazzaro

如何處理來自不同國家的英文拼寫在Python中的nltk標記器中的差異

回答

相關問題