Python和NLTK：基線標記器

我正在爲基準標記器編寫代碼。基於布朗語料庫，它將最常見的標籤分配給單詞。因此，如果單詞「作品」被標記爲動詞23次，並且作爲複數名詞30次，則基於用戶輸入句子中的單詞將其標記爲複數名詞。如果在語料庫中找不到該詞，則默認將其標記爲名詞。迄今爲止，我所使用的代碼不僅返回了最常用的單詞的每個標籤。我怎樣才能實現它只返回每個單詞的頻繁標籤？Python和NLTK：基線標記器

import nltk 
from nltk.corpus import brown 

def findtags(userinput, tagged_text): 
    uinput = userinput.split() 
    fdist = nltk.FreqDist(tagged_text) 
    result = [] 
    for item in fdist.items(): 
     for u in uinput: 
      if u==item[0][0]: 
       t = (u,item[0][1]) 
       result.append(t) 
     continue 
     t = (u, "NN") 
     result.append(t) 
    return result 

def main(): 
    tags = findtags("the quick brown fox", brown.tagged_words()) 
    print tags 

if __name__ == '__main__': 
    main()

來源

2014-01-08 Helena

wahaha，我要開始要求付款了，如果我回答你所有的nltk問題。洛茨，只是開玩笑，給我一分鐘打字。 – alvas

不好意思去了午餐，下面是你需要的'most_frequent_pos_tagger（）'。 – alvas

創建一個名爲word_tags，其關鍵是（註明）和價值字的標籤降頻列表字典（根據您的fdist。）

然後：

for u in uinput: 
    result.append(word_tags[u][0])

來源

2014-01-08 10:45:00 cyborg

如果是英語，NLTK中有一個默認的POS tagger，很多人一直在抱怨，但這是一個很好的quick-fix（更像是一個創可貼比撲熱息痛），請參閱POS tagging - NLTK thinks noun is adjective：

>>> from nltk.tag import pos_tag 
>>> from nltk.tokenize import word_tokenize 
>>> sent = "the quick brown fox" 
>>> pos_tag(word_tokenize(sent)) 
[('the', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN')]

如果你想培養一個從無到有的基線惡搞，我建議你遵循這樣的一個例子，但改變語料庫爲英語之一：https://github.com/alvations/spaghetti-tagger

通過建立一個UnigramTagger像spaghetti-tagger，你應該自動實現每個單詞最常見的標籤。

但是，如果你想這樣做非機器學習的方式，首先要計算字：POS，你需要的是某種類型的令牌比率。又見Part-of-speech tag without context using nltk：

from nltk.tag import pos_tag 
from nltk.tokenize import word_tokenize 
from collections import Counter, defaultdict 
from itertools import chain 

def type_token_ratio(documentstream): 
    ttr = defaultdict(list) 
    for token, pos in list(chain(*documentstream)): 
     ttr[token].append(pos) 
    return ttr 

def most_freq_tag(ttr, word): 
    return Counter(ttr[word]).most_common()[0][0] 

sent1 = "the quick brown fox quick me with a quick ." 
sent2 = "the brown quick fox fox me with a brown ." 
documents = [sent1, sent2] 

# Calculates the TTR. 
documents_ttr = type_token_ratio([pos_tag(word_tokenize(i)) for i in documents]) 

# Best tag for the word. 
print Counter(documents_ttr['quick']).most_common()[0] 

# Best tags for a sentence 
print [most_freq_tag(documents_ttr, i) for i in sent1.split()]

注：文檔流可以被定義爲句子的一個列表，其中每個句子包含了輸入/輸出標籤標記列表。

來源

2014-01-08 10:45:55 alvas

您可以簡單地使用計數器來查找列表中重複最多的項目：

的Python

from collections import Counter 
default_tag = Counter(tags).most_common(1)[0][0]

如果你的問題是「如何做一個單字，惡搞的工作？」你可能有興趣閱讀更多NLTK源代碼： http://nltk.org/_modules/nltk/tag/sequential.html#UnigramTagger

不管怎麼說，我建議你閱讀NLTK書第五章專門： http://nltk.org/book/ch05.html#the-lookup-tagger

就像在本書的例子，你可以有一個條件頻率發行，它返回每個給定單詞的最佳標籤。

cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())

在這種情況下cfd["fox"].max()將根據布朗語料庫返回最有可能的標籤爲「狐狸」。然後你就可以做出最有可能的標籤的字典你的句子的每個單詞：

likely_tags = dict((word, cfd[word].max()) for word in "the quick brown fox".split())

注意，在你的一句新詞，這將返回錯誤。但如果你明白這個想法，你可以製作自己的標籤。

來源

2014-01-12 23:55:05 Mehdi

Python和NLTK：基線標記器

回答

相關問題