在熊貓數據框中爲每一行循環IF語句

您好我是使用來自SAS背景的熊貓的新手，我嘗試使用以下代碼將連續變量分割爲多個波段。在熊貓數據框中爲每一行循環IF語句

var_range = df['BILL_AMT1'].max() - df['BILL_AMT1'].min() 
a= 10 
for i in range(1,a): 
    inc = var_range/a 
    lower_bound = df['BILL_AMT1'].min() + (i-1)*inc 
    print('Lower bound is '+str(lower_bound)) 
    upper_bound = df['BILL_AMT1'].max() + (i)*inc 
    print('Upper bound is '+str(upper_bound)) 
    if (lower_bound <= df['BILL_AMT1'] < upper_bound): 
     df['bill_class'] = i 
    i+=1

我期待的代碼檢查的df['BILL_AMT1']值是電流回路boundings內，並相應設置一個df['bill_class']。

我得到以下錯誤：

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我認爲，如果條件正確評估，但錯誤是由於分配新列循環計數器的值。

任何人都可以解釋發生了什麼問題或建議替代方案。

來源

2016-11-28 Luke Nisbet

爲了避免ValueError，更改

if (lower_bound <= df['BILL_AMT1'] < upper_bound): 
    df['bill_class'] = i

到

mask = (lower_bound <= df['BILL_AMT1']) & (df['BILL_AMT1'] < upper_bound) 
df.loc[mask, 'bill_class'] = i

的chained comparison(lower_bound <= df['BILL_AMT1'] < upper_bound)相當於

(lower_bound <= df['BILL_AMT1']) and (df['BILL_AMT1'] < upper_bound)

的and運算符會導致在布爾上下文中評估兩個布爾系列(lower_bound <= df['BILL_AMT1'])，(df['BILL_AMT1'] < upper_bound) - 即減少爲單個布爾值。熊貓refuses to reduce系列爲單個布爾值。

相反，返回一個布爾系列，使用&運營商，而不是and：

mask = (lower_bound <= df['BILL_AMT1']) & (df['BILL_AMT1'] < upper_bound)

，然後賦值給bill_class列，其中mask爲True，使用df.loc：

df.loc[mask, 'bill_class'] = i

要在df['BILL_AMT1']中對數據進行裝箱，您可以刪除Python for-loop完全，並作爲DSM suggests，使用pd.cut：

df['bill_class'] = pd.cut(df['BILL_AMT1'], bins=10, labels=False)+1

來源

2016-11-28 22:11:49 unutbu

@DSM：是的，完全是我的錯。 – unutbu

好多了。 :-)雖然我們可能應該推薦一種矢量化的方法（無論是pd.cut還是np.digitize - 我看到你已經有至少一個pd.cut答案引用..） – DSM

謝謝。我最終使用@DMS建議的方法，因爲我並不真正理解.loc並完全屏蔽了這些東西。 –

IIUC，這應該是修復你的代碼：

mx, mn = df['BILL_AMT1'].max(), df['BILL_AMT1'].min() 
rng = mx - mn 
a = 10 

for i in range(a): 
    inc = rng/a 
    lower_bound = mn + i * inc 
    print('Lower bound is ' + str(lower_bound)) 
    upper_bound = mn + (i + 1) * inc if i + 1 < a else mx 
    print('Upper bound is ' + str(upper_bound)) 
    ge = df['BILL_AMT1'].ge(lower_bound) 
    lt = df['BILL_AMT1'].lt(upper_bound) 
    df.loc[ge & lt, 'bill_class'] = i

然而
我應該這樣做

df['bill_class'] = pd.qcut(df['BILL_AMT1'], 10, list(range(10)))

來源

2016-11-28 22:35:00 piRSquared

在熊貓數據框中爲每一行循環IF語句

回答

相關問題