2016-01-08 38 views
0

我有一些文檔,我已經將其標記並轉換爲標記作爲元素的列表 - 然後將所有這些列表插入列表中,以便列出列表的令牌列表。刪除不頻繁出現的列表中的單詞

簡單的例子:

[["egg","apple","bread","milk","pear"], ["egg","apple","bread","milk"], ["egg","apple","bread","milk"]] 

我想刪除出現在比文件的X%少令牌(因爲它僅出現在一個文檔中了三個在上面的實例「鴨梨」。)但是,我不確定如何以高效的方式來做到這一點 - 我知道數據結構可能有問題,但我需要輸出爲這種格式的代碼的下一部分。

我當前的代碼看起來像這一點,顯然不是很有效時,有許多文件:

min_docs = 0.05*len(tokenized_document_list) 
whitelist = [] 
for document in tokenized_document_list: #Go through each document 
    for token in document: #Go through each token in each document 
     if token in whitelist: 
      continue 
     else: 
      token_count = 0 
      for document_t in tokenized_document_list: #Go through each document looking for token 
       if token in document_t: 
        token_count = token_count + 1 
        if token_count > min_docs: 
         whitelist.append(token) 
         break 
      if token_count < min_docs: 
       document.remove(token) 

任何建議,將不勝感激!

回答

1
from collections import defaultdict 
import six 


def calc_token_frequencies(doc_list): 
    frequencies = defaultdict(int) # Each dict item will start off as int(0) 
    for token_set in doc_list: 
     for token in token_set: 
      frequencies[token] += 1 
    return frequencies 


if __name__ == '__main__': 
    # Use a list of sets here in order to leverage set features 
    tokenized_document_list = [{"egg", "apple", "bread", "milk", "pear"}, 
           {"egg", "apple", "bread", "milk"}, 
           {"egg", "apple", "bread", "milk"}] 

    # Count the number of documents each token was in. 
    token_frequencies = calc_token_frequencies(tokenized_document_list) 

    # I used 50% here instead of the example 5% so that it would do something useful. 
    token_min_docs = 0.5*len(tokenized_document_list) 

    # Calculate the black list via set comprehension. 
    token_blacklist = {token for token, doc_count in six.iteritems(token_frequencies) 
         if doc_count < token_min_docs} 

    # Remove items on the black list 
    for doc_tokens in tokenized_document_list: 
     doc_tokens.difference_update(token_blacklist) 

    print tokenized_document_list 
+0

我應該澄清一點,我需要保持一個術語在文檔中出現的次數,這使得使用一個集合有問題。然而,這個解決方案的改變非常簡單 - 謝謝! – user2853043

相關問題