Dask DataFrame Groupby分區

我有一些相當大的csv文件（〜10GB），並希望利用dask進行分析。但是，根據設置要讀入的dask對象的分區數量，我的groupby結果會發生變化。我的理解是，dask利用了分區的核心處理優勢，但它仍然會返回適當的groupby輸出。這似乎並非如此，我正在努力確定需要什麼備用設置。下面是一個小例子：Dask DataFrame Groupby分區

df = pd.DataFrame({'A': np.arange(100), 'B': np.random.randn(100), 'C': np.random.randn(100), 'Grp1': np.repeat([1, 2], 50), 'Grp2': [3, 4, 5, 6], 25)}) 

test_dd1 = dd.from_pandas(df, npartitions=1) 
test_dd2 = dd.from_pandas(df, npartitions=2) 
test_dd5 = dd.from_pandas(df, npartitions=5) 
test_dd10 = dd.from_pandas(df, npartitions=10) 
test_dd100 = dd.from_pandas(df, npartitions=100) 

def test_func(x): 
    x['New_Col'] = len(x[x['B'] > 0.])/len(x['B']) 
    return x 

test_dd1.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() 
    A    B    C Grp1 Grp2 New_Col 
0 0 -0.561376 -1.422286  1  3  0.48 
1 1 -1.107799 1.075471  1  3  0.48 
2 2 -0.719420 -0.574381  1  3  0.48 
3 3 -1.287547 -0.749218  1  3  0.48 
4 4 0.677617 -0.908667  1  3  0.48 

test_dd2.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() 
    A    B    C Grp1 Grp2 New_Col 
0 0 -0.561376 -1.422286  1  3  0.48 
1 1 -1.107799 1.075471  1  3  0.48 
2 2 -0.719420 -0.574381  1  3  0.48 
3 3 -1.287547 -0.749218  1  3  0.48 
4 4 0.677617 -0.908667  1  3  0.48 

test_dd5.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() 
    A    B    C Grp1 Grp2 New_Col 
0 0 -0.561376 -1.422286  1  3  0.45 
1 1 -1.107799 1.075471  1  3  0.45 
2 2 -0.719420 -0.574381  1  3  0.45 
3 3 -1.287547 -0.749218  1  3  0.45 
4 4 0.677617 -0.908667  1  3  0.45 

test_dd10.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() 
    A    B    C Grp1 Grp2 New_Col 
0 0 -0.561376 -1.422286  1  3  0.5 
1 1 -1.107799 1.075471  1  3  0.5 
2 2 -0.719420 -0.574381  1  3  0.5 
3 3 -1.287547 -0.749218  1  3  0.5 
4 4 0.677617 -0.908667  1  3  0.5 

test_dd100.groupby(['Grp1', 'Grp2']).apply(test_func).compute().head() 
    A    B    C Grp1 Grp2 New_Col 
0 0 -0.561376 -1.422286  1  3  0 
1 1 -1.107799 1.075471  1  3  0 
2 2 -0.719420 -0.574381  1  3  0 
3 3 -1.287547 -0.749218  1  3  0 
4 4 0.677617 -0.908667  1  3  1 

df.groupby(['Grp1', 'Grp2']).apply(test_func).head() 
    A    B    C Grp1 Grp2 New_Col 
0 0 -0.561376 -1.422286  1  3  0.48 
1 1 -1.107799 1.075471  1  3  0.48 
2 2 -0.719420 -0.574381  1  3  0.48 
3 3 -1.287547 -0.749218  1  3  0.48 
4 4 0.677617 -0.908667  1  3  0.48

groupby步驟是否只在每個分區內運行，而不是查看整個數據幀？在這種情況下，設置npartitions = 1並不會影響性能，但由於read_csv會自動設置一定數量的分區，因此您如何設置調用以確保groupby結果準確無誤？

謝謝！

來源

2016-02-06 Bhage

我首先想到的是，dask的groupby/apply可能無法保證結果的順序，但它們可能都會在那裏。 – shoyer

是的，我一直在想，但是我做了各種獨特的切片，並且隨着分區數量的增加，組內的結果不同。例如，在一組獨特的'grp1/grp2'中會有2個不同的值。 – Bhage

對此問題的任何解決方案？ – codingknob

我很驚訝這個結果。無論分區數量多少，Groupby.apply都應該返回相同的結果。如果你能提供一個可重現的例子，我鼓勵你去raise an issue，其中一個開發者會看看。

來源

2016-02-06 15:06:59 MRocklin

Dask DataFrame Groupby分區

回答

相關問題