比較2個熊貓dataframes，逐行，通過細胞

電池我有2個dataframes，df1和df2，並要做到以下幾點，結果存儲在df3：比較2個熊貓dataframes，逐行，通過細胞

for each row in df1: 

    for each row in df2: 

     create a new row in df3 (called "df1-1, df2-1" or whatever) to store results 

     for each cell(column) in df1: 

      for the cell in df2 whose column name is the same as for the cell in df1: 

       compare the cells (using some comparing function func(a,b)) and, 
       depending on the result of the comparison, write result into the 
       appropriate column of the "df1-1, df2-1" row of df3)

例如，像：

df1 
A B C  D 
foo bar foobar 7 
gee whiz herp 10 

df2 
A B C  D 
zoo car foobar 8 

df3 
df1-df2 A    B    C     D 
foo-zoo func(foo,zoo) func(bar,car) func(foobar,foobar) func(7,8) 
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar) func(10,8)

我已經開始與此：

for r1 in df1.iterrows(): 
    for r2 in df2.iterrows(): 
     for c1 in r1: 
      for c2 in r2:

，但我不知道該怎麼辦，並希望得到一些幫助。

來源

2016-09-12 Zubo

因爲你應用FUNC同名的列，你可以遍歷僅通過列和使用矢量化，例如df3 ['A'] = func（df1 ['A']，df2 ['A']），等等？ – StarFox

@StarFox有趣，所以我可能會做類似於：df3中的列：df3 [column] = func（df1 [column]，df2 [column]）？ – Zubo

當然！這就是熊貓/ numpy的力量（一般來說，矢量化）。我將在下面提供一些示例，並且我們將從那裏開始 – StarFox

因此，爲了繼續評論中的討論，您可以使用矢量化，這是像熊貓或numpy這樣的圖書館的賣點之一。理想情況下，你永遠不應該打電話給iterrows()。爲了一點更加明確我的建議：

# with df1 and df2 provided as above, an example 
df3 = df1['A'] * 3 + df2['A'] 

# recall that df2 only has the one row so pandas will broadcast a NaN there 
df3 
0 foofoofoozoo 
1    NaN 
Name: A, dtype: object 

# more generally 

# we know that df1 and df2 share column names, so we can initialize df3 with those names 
df3 = pd.DataFrame(columns=df1.columns) 
for colName in df1: 
    df3[colName] = func(df1[colName], df2[colName])

現在，你可以甚至通過應用不同的功能不同的列，比如，創建lambda函數，然後與列名荏苒他們：

# some example functions 
colAFunc = lambda x, y: x + y 
colBFunc = lambda x, y; x - y 
.... 
columnFunctions = [colAFunc, colBFunc, ...] 

# initialize df3 as above 
df3 = pd.DataFrame(columns=df1.columns) 
for func, colName in zip(columnFunctions, df1.columns): 
    df3[colName] = func(df1[colName], df2[colName])

想到的唯一「難題」是您需要確保您的功能適用於列中的數據。例如，如果您要執行類似df1['A'] - df2['A']（與您所提供的df1，df2一樣），則會產生一個ValueError，因爲兩個字符串的相減是未定義的。只是要注意的事情。

編輯回覆：您的評論：這是可行的也是如此。迭代是較大dfX.columns，這樣你就不會碰到KeyError，並拋出一個if語句有：

# all the other jazz 
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']] 
# so iterate over df2 columns 
for colName in df2: 
    if colName not in df1: 
     df3[colName] = np.nan # be sure to import numpy as np 
    else: 
     df3[colName] = func(df1[colName], df2[colName])

來源

2016-09-12 21:12:13 StarFox

是的，這是非常有用的，我已經接受它作爲答案，非常感謝花時間！如果列數不相等，可以修改這個值嗎？即，df1中可能存在df2中不存在的列;比較函數應該只輸出類似N/A的內容。 – Zubo

比較2個熊貓dataframes，逐行，通過細胞

回答

相關問題