2016-05-24 113 views
1

在一個數據框中,我想比較列的元素與值,並將通過比較的元素排序到一個新的列。熊貓元素明智的比較和創建選擇

df = pandas.DataFrame([{'A':3,'B':10}, 
         {'A':2, 'B':30}, 
         {'A':1,'B':20}, 
         {'A':2,'B':15}, 
         {'A':2,'B':100}]) 

df['C'] = [x for x in df['B'] if x > 18] 

我無法找出什麼過錯,爲什麼我得到:

ValueError: Length of values does not match length of index

回答

2

正如達倫所說,DataFrame中的所有列應具有相同的長度。

當您嘗試print [x for x in df['B'] if x > 18]時,您只能得到[30, 20, 100]值。但是你有五個索引/行。這就是你得到Length of values does not match length of index錯誤的原因。

如下您可以更改代碼:

df['C'] = [x if x > 18 else None for x in df['B']] 
print df 

您將獲得:

A B  C 
0 3 10 NaN 
1 2 30 30.0 
2 1 20 20.0 
3 2 15 NaN 
4 2 100 100.0 
2

我認爲你可以使用locboolean indexing

print (df) 
    A B 
0 3 10 
1 2 30 
2 1 20 
3 2 15 
4 2 100 

print (df['B'] > 18) 
0 False 
1  True 
2  True 
3 False 
4  True 
Name: B, dtype: bool 

df.loc[df['B'] > 18, 'C'] = df['B'] 
print (df) 
    A B  C 
0 3 10 NaN 
1 2 30 30.0 
2 1 20 20.0 
3 2 15 NaN 
4 2 100 100.0 

如果你需要通過病症使用的選擇boolean indexing

print (df[df['B'] > 18]) 
    A B 
1 2 30 
2 1 20 
4 2 100 

如果需要更多的東西更快,可以用where

df['C'] = df.B.where(df['B'] > 18) 

時序len(df)=50k):

In [1367]: %timeit (a(df)) 
The slowest run took 8.34 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 1.14 ms per loop 

In [1368]: %timeit (b(df1)) 
100 loops, best of 3: 15.5 ms per loop 

In [1369]: %timeit (c(df2)) 
100 loops, best of 3: 2.93 ms per loop 

代碼時序

import pandas as pd 

df = pd.DataFrame([{'A':3,'B':10}, 
         {'A':2, 'B':30}, 
         {'A':1,'B':20}, 
         {'A':2,'B':15}, 
         {'A':2,'B':100}]) 
print (df) 
df = pd.concat([df]*10000).reset_index(drop=True) 
df1 = df.copy() 
df2 = df.copy() 

def a(df): 
    df['C'] = df.B.where(df['B'] > 18) 
    return df 

def b(df1):  
    df['C'] = ([x if x > 18 else None for x in df['B']]) 
    return df 

def c(df2):  
    df.loc[df['B'] > 18, 'C'] = df['B'] 
    return df 

print (a(df)) 
print (b(df1)) 
print (c(df2)) 
+0

我添加新的更快的方法,請檢查一下。謝謝。 – jezrael

0

所有列在DataFrame必須是相同的長度H。因爲你過濾出一些值,你試圖插入值減少到C柱比在列A和B.

所以,你的兩個選項來啓動一個新的數據幀爲C

dfC = [x for x in df['B'] if x > 18] 

或者當x不是18+時列中的某個虛擬值。例如: -

df['C'] = np.where(df['B'] > 18, True, False) 

甚至:

df['C'] = np.where(df['B'] > 18, 'Yay', 'Nay') 

附:另請參閱:Pandas conditional creation of a series/dataframe column以獲取其他方法。