匹配分層關鍵字的文檔

我具有層次的關鍵字樹中，被表示爲元組的列表，其中第一個參數是「路徑」，第二個是相應的關鍵字：匹配分層關鍵字的文檔

keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')]

列表「連接路徑「和相應的文檔（DOC一個可以有一個以上的‘路徑’：

docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')]

我想每個文檔的關鍵字匹配，並且產生這樣的結果：

docdict={doc1:[('key1','key2','key3'),('key1','key4','key5')],doc2:[('key1','key2')]}

我的問題是如何最有效地獲得所有（父）關鍵字？先謝謝你！

來源

2012-05-16 root

一個更可讀的答案，如果你有很多這樣的答案，可能會更好地擴展。

docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')] 
keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')] 

keydict = dict(keys) 
resultDict = {} 

for doc in docs: 
    (path, docname) = doc 
    pathList = path.split(',') 
    keyPath = [] 
    for i in range(0, len(pathList)): 
     aPath = ','.join(pathList[:i+1]) 
     keyPath.append(keydict[aPath]) 

    if docname not in resultDict : 
     resultDict[docname] = [] 
    resultDict[docname].append(tuple(keyPath)) 

print resultDict

來源

2012-05-16 10:47:41

編輯：我首先創建了一個方法來從關鍵字中獲取直接父項，但是這不能滿足要求。從我的理解，讓所有家長的關鍵字從路徑要好得多：

>>> def get_parents(keys, path): 
    """ Get all parent keywords """ 
    # Get parent path (remove after last ',') 
    parent_paths = [path] 
    while ',' in path: 
     path = ','.join(path.split(',')[:-1]) 
     parent_paths.append(path) 
    return [key for path, key in keys if path in parent_paths] 

>>> get_parents(keys, '0,2') 
['key1', 'key4'] 
>>> from collections import defaultdict 
>>> def create_doc_dict(keys, docs): 
    """ Create the required dict """ 
    docdict = defaultdict(list) 
    for path, doc in docs: 
     docdict[doc].append(get_parents(keys, path)) 
    return docdict 

>>> create_doc_dict(keys, docs) 
defaultdict(<type 'list'>, {'doc2': [['key1', 'key2']], 'doc1': [['key1', 'key2', 'key3'], ['key1', 'key4', 'key5']]})

來源

2012-05-16 10:37:21 Emmanuel

這幾乎你想要做什麼：

>>> docdict = {doc[-1]:[key[-1] for key in keys if doc[0].startswith(key[0])] for doc in docs} 
>>> docdict 
{'doc2': ['key1', 'key2'], 'doc1': ['key1', 'key4', 'key5']}

，這不正是你指定什麼：

>>> docdict = {} 
>>> for doc in docs: 
    docdict.setdefault(doc[-1],[]).append(tuple(key[-1] for key in keys if doc[0].startswith(key[0]))) 
>>> docdict 
{'doc2': [('key1', 'key2')], 'doc1': [('key1', 'key2', 'key3'), ('key1', 'key4', 'key5')]}

都是O(n*m)。

來源

2012-05-16 10:37:38 ch3ka

這也是另一種解決方案：

keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')] 
docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')] 

def get_path(p): 
    # tuples so that you can use them as dict keys 
    return tuple(p.split(',')) 

# we need to find the keys based on the paths, so make the path the dict's key 
keypaths = {get_path(p): key for p, key in keys} 

docdict = {} 
for p, doc in docs: 
    path = get_path(p) # we need the path as a tuple or list, so that you can get the parents via slicing 
    # get all parents of the path and the path itself. 
    # we remove one part of the path at a time and keep the original path also 
    all_paths = [path]+[path[:-i] for i in range(1,len(path))] 
    # you need to keep each doc/path combination alone, so you need a list to store it 
    if doc not in docdict: 
     docdict[doc] = [] 
    # add the keys whose path is in one of the parents or the path itself 
    docdict[doc].append([keypaths[path] for path in all_paths if path in keypaths]) 

print docdict # now, you see what you expect. :)

坦率地說，所有這些單行，代碼變得不可讀。所以，如果你同意，你應該更喜歡這個解決方案。

來源

2012-05-16 10:53:54 jadkik94

匹配分層關鍵字的文檔

回答

相關問題