使用python的位置索引

python入門。我正在嘗試使用嵌套字典來實現位置索引。不過，我不確定這是否應該走。索引應包含術語/術語頻率/文檔ID /術語位置。使用python的位置索引

例子：

dict = {term: {termfreq: {docid: {[pos1,pos2,...]}}}}

我的問題是：我在這裏正確的軌道上或是否有更好的解決我的問題。如果嵌套字典是要走的路，我還有一個問題：如何從字典中獲得單個項目：例如術語的術語頻率（沒有關於術語的所有附加信息）。對此的幫助非常感謝。

2012-02-09 root

每個term似乎都有術語頻率，文檔ID和位置列表。是對的嗎？如果是這樣，你可以使用類型的字典字典：給定一個期限

dct = { 'wassup' : { 
      'termfreq' : 'daily', 
      'docid' : 1, 
      'pos' : [3,4] }}

然後，像「日wassup」，你可以查找術語頻率

dct['wassup']['termfreq'] 
# 'daily'

覺得字典的作爲像電話簿一樣。查找給定鍵（名稱）的值（電話號碼）非常棒。查找給定值的鍵並不是那麼熱門。如果您知道需要單向查找，請使用字典。如果查找模式更復雜，您可能需要其他數據結構（數據庫也許？）。

您可能還想看看Natural Language Toolkit (nltk)。它內置了一個method for calculating tf_idf：

import nltk 

# Given a corpus of texts 
text1 = 'Lorem ipsum FOO dolor BAR sit amet' 
text2 = 'Ut enim ad FOO minim veniam, ' 
text3 = 'Duis aute irure dolor BAR in reprehenderit ' 
text4 = 'Excepteur sint occaecat BAR cupidatat non proident' 

# We split the texts into tokens, and form a TextCollection 
mytexts = (
    [nltk.word_tokenize(text) for text in [text1, text2, text3, text4]]) 
mycollection = nltk.TextCollection(mytexts) 

# Given a new text 
text = 'et FOO tu BAR Brute' 
tokens = nltk.word_tokenize(text) 

# for each token (roughly, word) in the new text, we compute the tf_idf 
for word in tokens: 
    print('{w}: {s}'.format(w = word, 
          s = mycollection.tf_idf(word,tokens)))

產生

et: 0.0 
FOO: 0.138629436112 
tu: 0.0 
BAR: 0.0575364144904 
Brute: 0.0

來源

2012-02-09 12:40:31 unutbu

我想建立一個泡菜文件，讓我實現textrank，TF-IDF和法律文件的語料庫搜索 - 所以我認爲它有一種方式 - 從鍵到價值。你的解決方案似乎有竅門。非常感謝。（因爲這是我第一週使用Python（實際上用任何語言編程），我可能會在不久的將來回到這裏:)） – root 2012-02-09 13:04:34

感謝您的補充。我在你的代碼中添加了bi-和trigrams。不知道在哪裏發佈它，因爲它有點偏離主題。 – root 2012-02-09 18:55:52

如果這是一個新問題，請發佈一個新問題。如果沒有，你可以在https://gist.github.com/上發佈，然後在這裏添加一個鏈接。 – unutbu 2012-02-09 19:00:48

使用python的位置索引

回答

相關問題