標記文本的Python NLTK配置

我不確定這是否可能，但我想我會問，以防萬一。說你有形式的實例數據集「身體|標籤」，例如標記文本的Python NLTK配置

"I went to the store and bought some bread" | shopping food

我想知道是否有使用NLTK搭配來算的話在cooccur次體字和標籤的數量的方式數據集。一個例子可能類似於（「麪包」，「食物」，598），其中「麪包」是一個身體詞，「食物」是一個標記詞，598是它們在數據集中共同發生的次數

來源

2013-11-02 user1893354

如果不使用NLTK，你可以這樣做：

from collections import Counter 
from itertools import product 

documents = '''"foo bar is not a sentence" | tag1 
"bar bar black sheep is not a real sheep" | tag2 
"what the bar foo is not a foo bar" | tag1''' 

documents = [i.split('|')[0].strip('" ') for i in documents.split('\n')] 

collocations = Counter() 

for i in documents: 
    # Get all the possible word collocations with product 
    # NOTE: this includes a token with itself. so we need 
    #  to remove the count for the token with itself. 
    x = Counter(list(product(i.split(),i.split()))) \ 
      - Counter([(i,i) for i in i.split()]) 
    collocations+=x 


for i in collocations: 
    print i, collocations[i]

你會碰上怎麼算的同一詞搭配的句子，比如一個問題，

酒吧酒吧黑羊是不是真羊

（'bar'，'bar'）的搭配數是多少？它是1的2嗎？上面的代碼給出了2，因爲第一個酒吧與第二個酒吧搭配，第二個酒吧搭配第一個酒吧。

來源

2013-12-15 12:55:36 alvas

標記文本的Python NLTK配置

回答

相關問題