使用ID陣列將重複字典項目轉換爲唯一項目

我有一個字典列表，其中一個字典值name包含我想規範化的重複數據。清單看起來像這樣：使用ID陣列將重複字典項目轉換爲唯一項目

[ 
    {'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 8}, 
    {'name': 'None on file', 'document_id': 40, 'annotation_id': 5}, 
    {'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 9}, 
    {'name': 'Western Union', 'document_id': 61, 'annotation_id': 11} 
]

我想要做的是創建一個新的字典，其中只包含唯一的名稱。但我需要跟蹤document_ids和annotation_ids。有時document_ids是相同的，但我只需要跟蹤它們與名稱關聯。所以上面的列表會變成：

[ 
    {'name': 'Craig McKray', 'document_ids': [50], 'annotation_ids': [8, 9]}, 
    {'name': 'None on file', 'document_ids': [40], 'annotation_id': [5]}, 
    {'name': 'Western Union', 'document_ids': [61], 'annotation_ids': [11]} 
]

這是到目前爲止，我已經試過代碼：

result = [] 
# resolve duplicate names 
result_row = defaultdict(list) 
for item in data: 
    for double in data: 
     if item['name'] == double['name']: 
      result_row['name'] = item['name'] 
      result_row['record_ids'].append(item['document_id']) 
      result_row['annotation_ids'].append(item['annotation_id']) 
      result.append(result_row)

與代碼的主要問題是，我比較和查找重複的，但是當我迭代到下一個項目時，它會再次找到重複項，從而創建一個無限循環。我如何編輯代碼，以便它不會一直比較重複的代碼？

來源

2017-07-18 Casey

請發表你得到的輸出。 – perigon

new = dict() 
for x in people: 
    if x['name'] in new: 
     new[x['name']].append({'document_id': x['document_id'], 'annotation_id': x['annotation_id']}) 
    else: 
     new[x['name']] = [{'document_id': x['document_id'], 'annotation_id': x['annotation_id']}]

這不正是你問什麼，但格式應該做你想要做的事。

這是輸出：

{'Craig McKray': [{'annotation_id': 8, 'document_id': 50}, {'annotation_id': 9, 'document_id': 50}], 'Western Union': [{'annotation_id': 11, 'document_id': 61}], 'None on file': [{'annotation_id': 5, 'document_id': 40}]}

在這裏，我想這可能是你最好：

from collections import defaultdict 
new = defaultdict(dict) 

for x in people: 
    if x['name'] in new: 
     new[x['name']]['document_ids'].append(x['document_id']) 
     new[x['name']]['annotation_ids'].append(x['annotation_id']) 
    else: 
     new[x['name']]['document_ids'] = [x['document_id']] 
     new[x['name']]['annotation_ids'] = [x['annotation_id']]

來源

2017-07-18 03:37:39

這很好用，但在這種情況下defaultdict如何工作？爲了我自己的教育。 – Casey

我們需要一個帶dict的defaultdict作爲默認值，以便我們可以將'annotation_ids'鍵添加到它，然後爲其分配一個列表。 –

功能更強大的itertools.groupby的做法可能是這樣。這有點神祕，所以我會解釋一下。

from itertools import groupby 
from operator import itemgetter 

inp = [ 
    {'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 8}, 
    {'name': 'None on file', 'document_id': 40, 'annotation_id': 5}, 
    {'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 9}, 
    {'name': 'Western Union', 'document_id': 61, 'annotation_id': 11} 
] 

def groupvals(vals): 

    namegetter = itemgetter('name') 
    doccanngetter = itemgetter('document_id', 'annotation_id') 

    for grouper, grps in groupby(sorted(vals, key=namegetter), key=namegetter): 

     docanns = [set(param) for param in zip(*(doccanngetter(g) for g in grps))] 
     yield {'name': grouper, 'document_id': list(docanns[0]), 'annotation_id': list(docanns[1])} 


for result in groupvals(inp): 
    print(result)

要使用groupby，您需要一個排序列表。所以首先按名稱排序。然後groupby名稱。接下來，您可以取出document_id和annotation_id參數並將其壓縮。這會將所有document_ids放在一個列表中，並將所有annotation_id放在另一個列表中。然後您可以撥打set刪除重複項並使用生成器將每個元素生成爲dict。

我使用了一個生成器，因爲它避免了需要建立結果列表。雖然你可以做到這一點，如果你想。

來源

2017-07-18 04:08:54

我就這個題目：

result = [] 
# resolve duplicate names 
all_names = [] 
for i, item in enumerate(data): 
    if item['name'] in all_names: 
     continue 
    result_row = {'name': item['name'], 'record_ids': [item['document_id']], 
        'annotation_ids':[item['annotation_id']]} 
    all_names.append(item['name']) 
    for j, double in enumerate(data): 
     if item['name'] == double['name'] and i != j: 
      result_row['record_ids'].append(double['document_id']) 
      result_row['annotation_ids'].append(double['annotation_id']) 
     result.append(result_row)

來源

2017-07-18 04:36:46

另一種選擇：

from collections import defaultdict 

catalog = defaultdict(lambda: defaultdict(list)) 

for d in dicts: 
    entry = catalog[d['name']] 
    for k in set(d) - {'name'}: 
     entry[k].append(d[k])

漂亮的打印

>>> for name, e in catalog.items(): 
>>>  print "'{0}': {1}".format(name, e) 

'Craig McKray': defaultdict(<type 'list'>, {'annotation_id': [8, 9], 'document_id': [50, 50]}) 
'Western Union': defaultdict(<type 'list'>, {'annotation_id': [11], 'document_id': [61]}) 
'None on file': defaultdict(<type 'list'>, {'annotation_id': [5], 'document_id': [40]})

來源

2017-07-18 06:43:51

使用ID陣列將重複字典項目轉換爲唯一項目

回答

相關問題