如何比較，然後從使用python的熊貓數據幀

我寫了這個代碼的兩個不同行串連信息：如何比較，然後從使用python的熊貓數據幀

import pandas as pd 
import numpy as np 

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']), 
    'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']), 
    'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']), 
    'Z' : pd.Series(['First',' ','Last','First',' ','Last'], ['1','2','3','4','5','6'])} 

output_table = pd.DataFrame(input_table) 

output_table['Previous_Y'] = output_table['Y'] 

output_table.Previous_Y = output_table.Previous_Y.shift(1) 

def Calc_flowpath(x): 
    if x['Z'] == 'First': 
     return x['Y'] 
    else: 
     return x['Previous_Y'] + x['Y']   

output_table['Flowpath'] = output_table.apply(Calc_flowpath, axis=1) 

print output_table

而且我的輸出是（預期）：

 W  X Y  Z Previous_Y Flowpath 
1 1.1 7.0 A First  NaN  A 
2 2.1 8.0 B     A  AB 
3 3.1 9.0 C Last   B  BC 
4 4.1 10.0 D First   C  D 
5 5.1 11.0 E     D  DE 
6 6.1 12.0 E Last   E  EE

然而，我想要做的Flowpath功能是：

If Column Z is "First", Flowpath = Column Y

If Column Z is anything else, Flowpath = Previous Flowpath value + Column Y

Unless Column Y repeats the same value, in which case skip that row.

我的目標輸出是：

 W  X Y  Z Previous_Y Flowpath 
1 1.1 7.0 A First  NaN  A 
2 2.1 8.0 B     A  AB 
3 3.1 9.0 C Last   B  ABC 
4 4.1 10.0 D First   C  D 
5 5.1 11.0 E     D  DE 
6 6.1 12.0 E Last   E  DE

爲了給出上下文，這些行是製造過程中的步驟，並且我試圖描述通過作業車間的路徑材料。我的數據是大量的客戶訂單和他們在製造過程中採取的每一步。 Y是製造步驟，Z列表示每個訂單的第一步和最後一步。我使用Knime來做分析，但是我找不到一個可以做到這一點的節點，所以我試圖自己寫一個python腳本，儘管我是編程新手（正如你可能會看到的那樣）。在我以前的工作中，我會使用多行節點在Alteryx中完成此操作，但我無法再訪問該軟件。我花了很多時間閱讀熊貓文檔，我覺得解決方案是DataFrame.loc，DataFrame.shift或DataFrame.cumsum的一些組合，但我無法弄清楚。

任何幫助將不勝感激。

來源

2016-08-13 user1673510

我鼓勵你接受@ Psidom的回答：它確實是你想要的，並且以一種非常優雅的方式 - 當然是最「可愛」的。 –

遍歷DataFrame的行並按照您在OP中概述的邏輯設置Flowpath列的值。

import pandas as pd 

output_table = pd.DataFrame({'W' :[1.1, 2.1, 3.1, 4.1, 5.1, 6.1], 
          'X': [7., 8., 9., 10., 11., 12.], 
          'Y': ['A', 'B', 'C', 'D', 'E', 'E'], 
          'Z': ['First', ' ', 'Last', 'First', ' ', 'Last']}, 
          index=range(1, 7)) 

output_table['Flowpath'] = '' 

for idx in output_table.index: 
    this_Z = output_table.loc[idx, 'Z'] 
    this_Y = output_table.loc[idx, 'Y'] 
    last_Y = output_table.loc[idx-1, 'Y'] if idx > 1 else '' 
    last_Flowpath = output_table.loc[idx-1, 'Flowpath'] if idx > 1 else '' 

    if this_Z == 'First': 
     output_table.loc[idx, 'Flowpath'] = this_Y 
    elif this_Y != last_Y: 
     output_table.loc[idx, 'Flowpath'] = last_Flowpath + this_Y 
    else: 
     output_table.loc[idx, 'Flowpath'] = last_Flowpath

來源

2016-08-13 16:56:53

所以不好的事情會發生，如果Z['1']!='First'，但爲你的情況下，這工作。我明白你想要更多的東西熊貓十歲上下，所以我很抱歉，這個答案是非常簡單的蟒蛇......

import pandas as pd 
import numpy as np 

input_table = {'W' : pd.Series([1.1,2.1,3.1,4.1,5.1,6.1], index = ['1','2','3','4','5','6']), 
    'X' : pd.Series([7.,8.,9.,10.,11.,12.], index = ['1','2','3','4','5','6']), 
    'Y' : pd.Series(['A','B','C','D','E','E'], index = ['1','2','3','4','5','6']), 
    'Z' : pd.Series(['First',' ','Last','First',' ','Last'], index =['1','2','3','4','5','6'])} 

ret = pd.Series([None,None,None,None,None,None], index = ['1','2','3','4','5','6']) 
for k in [str(n) for n in range(1,7)]: 
    if(input_table['Z'][k]=='First'): 
     op = input_table['Y'][k] 
    else: 
     if(input_table['Y'][k]==input_table['Y'][str(int(k)-1)]): 
      op = ret[str(int(k)-1)] 
     else: 
      op = ret[str(int(k)-1)]+input_table['Y'][k] 
    ret[k]=op 

input_table['Flowpath'] = ret 
output_table = pd.DataFrame(input_table) 
print output_table

粒錐

Flowpath W X Y  Z 
1  A 1.1 7 A First 
2  AB 2.1 8 B  
3  ABC 3.1 9 C Last 
4  D 4.1 10 D First 
5  DE 5.1 11 E  
6  DE 6.1 12 E Last

來源

2016-08-13 17:04:54 kpie

您可以通過cumsum上計算一組變量其中Z爲first的條件向量滿足第一個和第二個條件，並用空字符串替換上一個相同的值，以便您可以在Y列上執行cumsum，該列應該給出預期的輸出：

import pandas as pd 
# calculate the group varaible 
grp = (output_table.Z == "First").cumsum() 

# calculate a condition vector where the current Y column is the same as the previous one 
dup = output_table.Y.groupby(grp).apply(lambda g: g.shift() != g) 

# replace the duplicated process in Y as empty string, group the column by the group variable 
# calculated above and then do a cumulative sum 
output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).cumsum() 

output_table 

#  W X Y  Z flowPath 
# 1 1.1 7 A First   A 
# 2 2.1 8 B     AB 
# 3 3.1 9 C Last   ABC 
# 4 4.1 10 D First   D 
# 5 5.1 11 E     DE 
# 6 6.1 12 E Last   DE

更新：在上面的代碼工作0.15.2下但不0.18.1，但下面可以節省一點點到最後一行調整：

output_table['flowPath'] = output_table.Y.where(dup, "").groupby(grp).apply(pd.Series.cumsum)

來源

2016-08-13 17:07:33 Psidom

美麗的邏輯！不幸的是，最後一個'.cumsum（）'在0.18.1中引發了一個'DataError：無數字類型聚合'。 –

@ AlbertoGarcia-Raboso看起來API已經改變了一點，只是更新了一個在'0.18.1'中工作的方法。 Thx爲好。 – Psidom

我的榮幸。我認爲這可能是一個錯誤。我會在Github上報告。 –

for index, row in output_table.iterrows(): 
    prev_index = str(int(index) - 1) 
    if row['Z'] == 'First': 
     output_table.set_value(index, 'Flowpath', row['Y']) 
    elif output_table['Y'][prev_index] == row['Y']: 
     output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index]) 
    else: 
     output_table.set_value(index, 'Flowpath', output_table['Flowpath'][prev_index] + row['Y']) 

print output_table 

    W  X Y  Z Previous_Y Flowpath 
1 1.1 7.0 A First  NaN  A 
2 2.1 8.0 B     A  AB 
3 3.1 9.0 C Last   B  ABC 
4 4.1 10.0 D First   C  D 
5 5.1 11.0 E     D  DE 
6 6.1 12.0 E Last   E  DE

來源

2016-08-13 18:25:47

如何比較，然後從使用python的熊貓數據幀

回答

相關問題