如何爲nltk難題提供（或生成）標籤

我有一套文檔，我想將它們轉換成這種形式，以便能夠對這些文檔中的單詞進行tfidf計數（以便每個文檔由tfidf數字的向量表示）。如何爲nltk難題提供（或生成）標籤

我認爲這足以稱呼WordNetLemmatizer.lemmatize（單詞），然後PorterStemmer - 但所有的'have'，'has'，'had'等都沒有被lemmatizer轉化爲'have'它也適用於其他詞彙。然後我讀到，我應該爲lemmatizer提供一個提示 - 標籤代表一種單詞 - 無論是名詞，動詞，形容詞等。

我的問題是 - 如何獲得這些標籤？爲了得到這些，我應該在這些文件上執行哪些操作？

我正在使用python3.4，而且我一次只能詞+詞幹單個詞。我嘗試了WordNetLemmatizer和來自nltk的EnglishStemmer，以及stemming.porter2的stem（）。

來源

2016-11-12 Zbyszek M.

好的，我搜索了更多，我發現如何獲得這些標籤。首先必須做一些預處理，以確保該文件將得到標記（在我的情況下，它是關於從pdf轉換爲txt後刪除了一些遺留的東西）。

然後這些文件必須被標記爲句子，然後將每個句子轉換爲單詞數組，然後可以通過nltk tagger進行標記。通過這種詞法化可以完成，然後在其上添加詞幹。

from nltk.tokenize import sent_tokenize, word_tokenize 
# use sent_tokenize to split text into sentences, and word_tokenize to 
# to split sentences into words 
from nltk.tag import pos_tag 
# use this to generate array of tuples (word, tag) 
# it can be then translated into wordnet tag as in 
# [this response][1]. 
from nltk.stem.wordnet import WordNetLemmatizer 
from stemming.porter2 import stem 

# code from response mentioned above 
def get_wordnet_pos(treebank_tag): 
    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return ''  


with open(myInput, 'r') as f: 
    data = f.read() 
    sentences = sent_tokenize(data) 
    ignoreTypes = ['TO', 'CD', '.', 'LS', ''] # my choice 
    lmtzr = WordNetLemmatizer() 
    for sent in sentences: 
     words = word_tokenize(sentence) 
     tags = pos_tag(words) 
     for (word, type) in tags: 
      if type in ignoreTypes: 
       continue 
      tag = get_wordnet_pos(type) 
      if tag == '': 
       continue 
      lema = lmtzr.lemmatize(word, tag) 
      stemW = stem(lema)

而在這一點上，我得到朵朵字stemW，我可以再寫入文件，並使用這些計算每個文檔TFIDF向量。

來源

2016-11-13 22:05:00

如何爲nltk難題提供（或生成）標籤

回答

相關問題