刪除不頻繁出現的列表中的單詞

我有一些文檔，我已經將其標記並轉換爲標記作爲元素的列表 - 然後將所有這些列表插入列表中，以便列出列表的令牌列表。刪除不頻繁出現的列表中的單詞

簡單的例子：

[["egg","apple","bread","milk","pear"], ["egg","apple","bread","milk"], ["egg","apple","bread","milk"]]

我想刪除出現在比文件的X％少令牌（因爲它僅出現在一個文檔中了三個在上面的實例「鴨梨」。）但是，我不確定如何以高效的方式來做到這一點 - 我知道數據結構可能有問題，但我需要輸出爲這種格式的代碼的下一部分。

我當前的代碼看起來像這一點，顯然不是很有效時，有許多文件：

min_docs = 0.05*len(tokenized_document_list) 
whitelist = [] 
for document in tokenized_document_list: #Go through each document 
    for token in document: #Go through each token in each document 
     if token in whitelist: 
      continue 
     else: 
      token_count = 0 
      for document_t in tokenized_document_list: #Go through each document looking for token 
       if token in document_t: 
        token_count = token_count + 1 
        if token_count > min_docs: 
         whitelist.append(token) 
         break 
      if token_count < min_docs: 
       document.remove(token)

任何建議，將不勝感激！

來源

2016-01-08 user2853043

from collections import defaultdict 
import six 


def calc_token_frequencies(doc_list): 
    frequencies = defaultdict(int) # Each dict item will start off as int(0) 
    for token_set in doc_list: 
     for token in token_set: 
      frequencies[token] += 1 
    return frequencies 


if __name__ == '__main__': 
    # Use a list of sets here in order to leverage set features 
    tokenized_document_list = [{"egg", "apple", "bread", "milk", "pear"}, 
           {"egg", "apple", "bread", "milk"}, 
           {"egg", "apple", "bread", "milk"}] 

    # Count the number of documents each token was in. 
    token_frequencies = calc_token_frequencies(tokenized_document_list) 

    # I used 50% here instead of the example 5% so that it would do something useful. 
    token_min_docs = 0.5*len(tokenized_document_list) 

    # Calculate the black list via set comprehension. 
    token_blacklist = {token for token, doc_count in six.iteritems(token_frequencies) 
         if doc_count < token_min_docs} 

    # Remove items on the black list 
    for doc_tokens in tokenized_document_list: 
     doc_tokens.difference_update(token_blacklist) 

    print tokenized_document_list

來源

2016-01-08 19:41:58 shattar

我應該澄清一點，我需要保持一個術語在文檔中出現的次數，這使得使用一個集合有問題。然而，這個解決方案的改變非常簡單 - 謝謝！ – user2853043

刪除不頻繁出現的列表中的單詞

回答

相關問題