處理大熊貓數據幀（模糊匹配）

我想做模糊匹配，其中我從大數據框（130.000行）的列到列表（400行）的字符串進行匹配。我寫的代碼是在一個小樣本上測試的（匹配3000行到400行）並且工作正常。它太大複製到這裏，但它大致是這樣的：處理大熊貓數據幀（模糊匹配）

1）列 2的數據標準化）創建笛卡爾積列和計算Levensthein距離 3）選擇在單獨的得分最高的比賽和商店的large_csv_name「名單。 4）比較'large_csv_names'到'large_csv'的列表，拉出所有相交的數據並寫入一個csv。

由於笛卡爾產品包含超過5000萬條記錄，我很快遇到了內存錯誤。

這就是爲什麼我想知道如何將大數據集分成塊，然後運行我的腳本。

到目前爲止，我曾嘗試：

df_split = np.array_split(df, x (e.g. 50 of 500)) 
for i in df_split: 
    (step 1/4 as above)

除了：

for chunk in pd.read_csv('large_csv.csv', chunksize= x (e.g. 50 or 500)) 
    (step 1/4 as above)

這些方法都似乎工作。我想知道如何在塊中運行模糊匹配，即將大塊的csv切成小塊，運行代碼，取一塊，運行代碼等。

來源

2017-09-03 Michiel V.

你可能想要檢查[dask]（https://dask.pydata.org/en/latest/），它可以從磁盤上懶懶的加載數據幀 – Quickbeam2k1

與此同時，我寫了一篇腳本，以塊爲單位切分數據幀，然後每個腳本都可以進一步處理。由於我是python的新手，代碼可能有點混亂，但我仍然想與那些可能會陷入同樣問題的人分享。

import pandas as pd 
import math 


partitions = 3 #number of ways to split df 
length = len(df) 

list_index = list(df.index.values) 
counter = 0  #var that will be used to stop slicing when df ends 
block_counter0 = 0  #var which will indicate the begin index of slice                
block_counter1 = block_counter0 + math.ceil(length/partitions) #likewise 
while counter < int(len(list_index)):  #stop slicing when df ends 
    df1 = df.iloc[block_counter0:block_counter1] #temp df that forms chunk 
    for i in range(block_counter0, block_counter1): 

     #insert operations on row of df1 here 

    counter += 1 #increase counter by 1 to stop slicing in time 
    block_counter0 = block_counter1 #when for loop ends indices areupdated 
    if block_counter0 + math.ceil(length/partitions) > 
      int(len(list_index)): 
     block_counter1 = len(list_index) 
     counter +=1 
    else: 
     block_counter1 = block_counter0 + math.ceil(length/partitions)

來源

2017-09-09 18:02:39

處理大熊貓數據幀（模糊匹配）

回答

相關問題