其中一個詞彙表的單詞出現

我有下面的代碼文檔的索引的高效節能：其中一個詞彙表的單詞出現

def index(self): 
    """ 
    Build an index of the documents. 
    """ 
    print "Indexing..." 
    # ------------------------------------------------------------------ 
    # TODO: Create an inverted index. 
    #  Granted this may not be a linked list as in a proper 
    #  implementation. 

    inv_index = collections.defaultdict(lambda: []) 

    tam = len(self.docs) 
    for word in self.vocab: 
     for i in xrange(tam): 
      if word in self.docs[i]: 
       inv_index[word].append(i)  

    print "End indexing..." 
    # ------------------------------------------------------------------ 
    self.inv_index = inv_index

其全成索引，但過低（20左右〜分鐘），我怎麼能做到這一點的少於10秒？

self.vocab：所有不同的列表（梗）詞語
self.docs：列表的列表，其中第i個文件是
self.docs [I] => [ 'word1'，'word2'，...，'wordN']

self.vocab是我自己的詞彙表中的詞，我需要索引出現該詞的文檔數。

來源

2014-03-28 SerCrAsH

使用字典和集：

inv_index = collections.defaultdict(set) 

vocabulary = set(self.vocab) 
for i, document in enumerate(self.docs): 
    in_document = vocabulary & set(document) 
    for word in in_document: 
     inv_index[word].add(i)

來源

2014-03-28 14:47:04

製作defaultdict的使用集看起來過於誇張，但只是循環遍歷文檔一次，並使用集交集絕對是正確的路要走。 – Midnighter

@Midnighter我看不出使用列表，它只會使追加和搜索更長。如果你正在創建倒排索引，問題是如果有東西在那裏，你會檢查很多次。在集合中檢查存在性更快，所以再次設置似乎是更好的選擇。 –

我不知道如何使用inv_index，你可能是正確的。 – Midnighter

您一定要將self.docs的元素從列表轉換爲集合。你的電話if word in self.docs[i]是O（n）的列表操作，但是O（1）的操作集合。您可以使用列表初始化您的defaultdict，即btw，即defaultdict(list)。

來源

2014-03-28 14:40:34 Midnighter

我可以看到兩個問題與您的實現：

對於列表中的每一個字，你是否這個詞是一個文檔中，如果它是一個索引添加到它。

這是正確的，但基本上每個單詞都在閱讀整個文檔。時間複雜度爲O（W x D x L）其中W是詞的數量，D是文檔的數量，L是文檔的平均長度。

我們可以假設你的詞彙是由獨特的單詞組成（否則它沒有任何意義）。

一個改進就是創建一組所有單詞。這可以在O（W）攤銷時間內完成。

然後對於文檔中的每個單詞，檢查它是否在集合中，如果是，則將其添加到索引中。這兩個操作都可以在O（1）中爲每個單詞完成。

總體而言，算法現在將成爲O（W +（d×長））

另外，如果您的文檔可以通過刪除重複進行壓縮，可以加快由壓縮因子的過程。

來源

2014-03-28 14:41:33

我想接受這個答案太:) – SerCrAsH

這就是你基本上試圖做

for each word in vocabulary: 
    for doc_index, doc in enummerate documents: 
     if word in document: 
      index[word].append(doc_index)

讓我們說你在翻譯1000個字和1000個文檔。這意味着您將運行if word in document: 1000 * 1000次。我認爲word in document聲明將讀取整個文檔，這並不便宜，特別是如果文檔很大。

更簡單的邏輯：

for doc_index, doc in enummerate documents: 
    for each word in doc: 
     index[word].append(doc_index)

這種方式可以消除word in document昂貴的操作。

關於此行的一些注意事項：for each word in doc:
您需要標記文檔以便能夠遍歷文檔的單個單詞。想想類似的空格分割，或者如果你想有一個更強大的解決方案，我建議使用NLTK分詞器模塊，見http://text-processing.com/demo/tokenize/例如：

import nltk 
sentence = """At eight o'clock on Thursday morning Arthur didn't feel very good.""" 
tokens = nltk.word_tokenize(sentence) 
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

來源

2014-03-28 14:52:32 bpgergo

其中一個詞彙表的單詞出現

回答

相關問題