我試圖做一個簡單的位置索引,但有一些問題得到正確的輸出。簡單的內存位置倒排索引python
給出一個字符串(句子)的列表我想使用sting列表中的字符串位置作爲文檔id,然後迭代句子中的單詞並使用句子中的單詞index作爲它的位置。然後使用文檔ID的元組更新單詞詞典,並在文檔中定位它。
代碼:
主FUNC -
def doc_pos_index(alist):
inv_index= {}
words = [word for line in alist for word in line.split(" ")]
for word in words:
if word not in inv_index:
inv_index[word]=[]
for item, index in enumerate(alist): # find item and it's index in list
for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index
if item2 in inv_index:
inv_index[i].append(tuple(index, index2)) # if word in index update it's list with tuple of doc index and position
return inv_index
示例清單:
doc_list= [
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed'
]
期望的輸出:
{'Delivered': [(0,1),(1,1),(2,1),(3,1),(4,1)],
'necessary': [(0,3),(1,3),(2,3),(3,3),(4,3)],
'dejection': [(0,2),(1,2),(2,2),(3,2),(4,2)],
ect...}
電流輸出:
{'Delivered': [],
'necessary': [],
'dejection': [],
'do': [],
'objection': [],
'prevailed': [],
'mr': [],
'hello': []}
我知道收集libarary和NLTK,但我主要是爲了學習/實踐的原因這樣做。
你已經得到了'枚舉'退步的順序。你想'索引,枚舉項目(alist):' –