2
將join
應用於由.from_delayed方法生成的dask數據框時,我得到了意外的結果。我想通過下面的例子來演示這個例子,它由三部分組成。dask數據框中的.join結果似乎取決於方式,生成了dask數據框
- 生成經由
from_delayed
方法DASK數據幀,並通過from_pandas
- 產生的數據幀DASK加入它轉換既dataframes到大熊貓與
compute
方法dataframes。 (1) - 將通過
from_delayed
方法生成的dask數據幀轉換爲使用compute
的pandas。使用from_pandas
將其轉換回dask。然後加入(1)。
考慮下面的代碼:
import dask.dataframe
import pandas as pd
# functions for generating a dask dataframe
def get_pdf(character):
'''constructs a pandas dataframe with indexes [character]1, ..., [character]5'''
index = [character + str(i) for i in range(5)]
return pd.DataFrame({'A':[1,2,3,4,5]}, index = index)
def get_ddf():
'''constructs dask dataframe out of pandas dataframes via the .from-delayed method with indexes A1, A2, A3, ... F3, F3, F4'''
delayed_list = [dask.delayed(get_pdf)(x) for x in 'ABCDEF']
return dask.dataframe.from_delayed(delayed_list)
#generate dask dataframes, that will be joined
ddf1 = get_ddf()
ddf2 = dask.dataframe.from_pandas(pd.DataFrame({'B': [1,2,3]}, index = ['A0', 'B1', 'C3']), npartitions = 2)
#recreate ddf1 by converting it to a pandas dataframe and afterwards to a dask dataframe
ddf1_from_pandas = dask.dataframe.from_pandas(ddf1.compute(), npartitions = 3)
#compute joins
dask_from_delayed_join = ddf1.join(ddf2, how = 'inner')
pandas_join = ddf1.compute().join(ddf2.compute(), how = 'inner')
dask_from_pandas_join = ddf1_from_pandas.join(ddf2, how = 'inner')
我希望所有的三個結果(dask_from_delayed_join
,pandas_join
,dask_from_pandas_join
)是相同的。
然而,第一結果不同於其他:
print(dask_from_delayed_join.compute())
:
Empty DataFrame
Columns: [A, B]
Index: []
print(pandas_join)
:
A B
A0 1 1
B1 2 2
C3 4 3
print(dask_from_pandas_join.compute())
:
A B
A0 1 1
B1 2 2
C3 4 3
發生了什麼事?
我正在調查這現在順便說一句。希望能在一三天內得到答案。 – MRocklin