我想使用https://github.com/datamade/dedupe來刪除python中的一些記錄。看他們的例子Python重複數據刪除記錄 - 重複數據刪除

data_d = {} 
for row in data: 
    clean_row = [(k, preProcess(v)) for (k, v) in row.items()] 
    row_id = int(row['id']) 
    data_d[row_id] = dict(clean_row)

字典消耗了相當多的內存，一個由pandas創建的字典，一個pd.Datafrmae，甚至一個普通的pd.Dataframe。

如果這種格式是必需的，我怎樣纔能有效地將pd.Dataframe轉換成這樣的字典？

編輯

例什麼大熊貓產生

{'column1': {0: 1389225600000000000, 
    1: 1388707200000000000, 
    2: 1388707200000000000, 
    3: 1389657600000000000,....

例什麼重複數據刪除預計

{'1': {column1: 1389225600000000000, column2: "ddd"}, 
'2': {column1: 1111, column2: "ddd} ...}

來源

2016-09-18 Georg Heiler

您可以使用'DataFrame.to_dict（）'將Pandas Dataframe轉換爲字典，這就是您要查找的內容嗎？ –

事實上，這是列>索引>值，他們似乎需要索引>列>價值，它重新生成每個記錄的列鍵 –

我認爲這將從數據的例子大大受益。 – chthonicdaemon

看來，df.to_dict(orient='index')會產生你所尋找的表示：

進口大熊貓

個

data = [[1, 2, 3], [4, 5, 6]] 
columns = ['a', 'b', 'c'] 

df = pandas.DataFrame(data, columns=columns) 

df.to_dict(orient='index')

結果

{0: {'a': 1, 'b': 2, 'c': 3}, 1: {'a': 4, 'b': 5, 'c': 6}}

來源

2016-09-18 07:35:38 chthonicdaemon

你可以嘗試這樣的事情：

df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [6,7,8,9,10]}) 
A B 
0 1 6 
1 2 7 
2 3 8 
3 4 9 
4 5 10 

print(df.T.to_dict()) 
{0: {'A': 1, 'B': 6}, 1: {'A': 2, 'B': 7}, 2: {'A': 3, 'B': 8}, 3: {'A': 4, 'B': 9}, 4: {'A': 5, 'B': 10}}

這是在@chthonicdaemon答案輸出相同的，所以他的回答可能是更好的。我正在使用pandas.DataFrame.T轉置索引和列。

來源

2016-09-18 07:42:09

不需要python字典，只需要一個允許按列名進行索引的對象。即row['col_name']

因此，假設data是大熊貓數據框應該只可以做一些事情，如：

data_d = {} 
for row_id, row in data.iterrows(): 
    data_d[row_id] = row

這就是說，蟒蛇類型的字典的內存開銷不會是，你必須在內存瓶頸重複數據刪除。

來源

2016-09-18 13:30:07 fgregg

Python重複數據刪除記錄 - 重複數據刪除

編輯

回答

相關問題