2012-05-16 119 views
2

我具有層次的關鍵字樹中,被表示爲元組的列表,其中第一個參數是「路徑」,第二個是相應的關鍵字:匹配分層關鍵字的文檔

keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')] 

列表「連接路徑「和相應的文檔(DOC一個可以有一個以上的‘路徑’:

docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')] 

我想每個文檔的關鍵字匹配,並且產生這樣的結果:

docdict={doc1:[('key1','key2','key3'),('key1','key4','key5')],doc2:[('key1','key2')]} 

我的問題是如何最有效地獲得所有(父)關鍵字?先謝謝你!

回答

1

一個更可讀的答案,如果你有很多這樣的答案,可能會更好地擴展。

docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')] 
keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')] 

keydict = dict(keys) 
resultDict = {} 

for doc in docs: 
    (path, docname) = doc 
    pathList = path.split(',') 
    keyPath = [] 
    for i in range(0, len(pathList)): 
     aPath = ','.join(pathList[:i+1]) 
     keyPath.append(keydict[aPath]) 

    if docname not in resultDict : 
     resultDict[docname] = [] 
    resultDict[docname].append(tuple(keyPath)) 

print resultDict 
0

編輯:我首先創建了一個方法來從關鍵字中獲取直接父項,但是這不能滿足要求。從我的理解,讓所有家長的關鍵字從路徑要好得多:

>>> def get_parents(keys, path): 
    """ Get all parent keywords """ 
    # Get parent path (remove after last ',') 
    parent_paths = [path] 
    while ',' in path: 
     path = ','.join(path.split(',')[:-1]) 
     parent_paths.append(path) 
    return [key for path, key in keys if path in parent_paths] 

>>> get_parents(keys, '0,2') 
['key1', 'key4'] 
>>> from collections import defaultdict 
>>> def create_doc_dict(keys, docs): 
    """ Create the required dict """ 
    docdict = defaultdict(list) 
    for path, doc in docs: 
     docdict[doc].append(get_parents(keys, path)) 
    return docdict 

>>> create_doc_dict(keys, docs) 
defaultdict(<type 'list'>, {'doc2': [['key1', 'key2']], 'doc1': [['key1', 'key2', 'key3'], ['key1', 'key4', 'key5']]}) 
1

這幾乎你想要做什麼:

>>> docdict = {doc[-1]:[key[-1] for key in keys if doc[0].startswith(key[0])] for doc in docs} 
>>> docdict 
{'doc2': ['key1', 'key2'], 'doc1': ['key1', 'key4', 'key5']} 

,這不正是你指定什麼:

>>> docdict = {} 
>>> for doc in docs: 
    docdict.setdefault(doc[-1],[]).append(tuple(key[-1] for key in keys if doc[0].startswith(key[0]))) 
>>> docdict 
{'doc2': [('key1', 'key2')], 'doc1': [('key1', 'key2', 'key3'), ('key1', 'key4', 'key5')]} 

都是O(n*m)

1

這也是另一種解決方案:

keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')] 
docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')] 

def get_path(p): 
    # tuples so that you can use them as dict keys 
    return tuple(p.split(',')) 

# we need to find the keys based on the paths, so make the path the dict's key 
keypaths = {get_path(p): key for p, key in keys} 

docdict = {} 
for p, doc in docs: 
    path = get_path(p) # we need the path as a tuple or list, so that you can get the parents via slicing 
    # get all parents of the path and the path itself. 
    # we remove one part of the path at a time and keep the original path also 
    all_paths = [path]+[path[:-i] for i in range(1,len(path))] 
    # you need to keep each doc/path combination alone, so you need a list to store it 
    if doc not in docdict: 
     docdict[doc] = [] 
    # add the keys whose path is in one of the parents or the path itself 
    docdict[doc].append([keypaths[path] for path in all_paths if path in keypaths]) 

print docdict # now, you see what you expect. :) 

坦率地說,所有這些單行,代碼變得不可讀。所以,如果你同意,你應該更喜歡這個解決方案。