2017-04-05 84 views
1

我有數據幀,看起來像這樣:由字符串排序數據框值

Name Net Worth 
A  100M 
B  200M 
C  5M 
D  40M 
E  10B 
F  2B 

我想通過價值淨值列進行排序,什麼是排序值處於這種最優化的方式? M表示百萬,B表示十億,10B表示最高值。

回答

2

您可以使用replace,創造新的分類Series然後reindex原:

d = {'M': '0'*6, 'B': '0'*9} 
s = df['Net Worth'].replace(d, regex=True).astype(float).sort_values(ascending=False) 
print (df.reindex(s.index)) 
    Name Net Worth 
4 E  10B 
5 F  2B 
1 B  200M 
0 A  100M 
3 D  40M 
2 C  5M 

更通用的解決方案,如果有的floats在數據:

print (df) 
    Name Net Worth 
0 A   1 
1 B  200M 
2 C  5M 
3 D  40M 
4 E  1.0B 
5 F  2B 

#dict for multiple 
d = {'M': 10**6, 'B': 10**9} 
#all keys of dict separated by | (or) 
k = '|'.join(d.keys()) 

#replace by dict 
a = df['Net Worth'].replace(d, regex=True).astype(float) 
#remove M,B 
b = df['Net Worth'].replace([k], '', regex=True).astype(float) 
#multiple together, sorts 
s = a.mul(b).sort_values(ascending=False) 
#reindex - get sorted original 
print (df.reindex(s.index)) 
    Name Net Worth 
5 F  2B 
4 E  1.0B 
1 B  200M 
3 D  40M 
2 C  5M 
0 A   1 

而且隨着extract另一個類似的解決方案:

#dict for replace 
_prefix = {'k': 1e3, # kilo 
      'M': 1e6, # mega 
      'B': 1e9, # giga 
} 
#all keys of dict separated by | (or) 
k = '|'.join(_prefix.keys()) 
#extract values to new df 
df1 = df['Net Worth'].str.extract('(?P<a>[0-9.]*)(?P<b>' + k +')*', expand=True) 
#convert numeric column to float 
df1.a = df1.a.astype(float) 
#map values by dictionary, replace NaN (no prefix) to 1 
df1.b = df1.b.map(_prefix).fillna(1) 
#multiple columns together 
s = df1.a.mul(df1.b).sort_values(ascending=False) 
print (s) 
#sorting by reindexing 
print (df.reindex(s.index)) 
    Name Net Worth 
5 F  2B 
4 E  1.0B 
1 B  200M 
3 D  40M 
2 C  5M 
0 A   1 
+0

謝謝!我有一個問題,在這種情況下,regex = True的作用是什麼?它是否與'df.str.replace()'類似? –

+0

是的,它是相似的,但更好的工作與'Dict'。如果需要用子字符串替換,則需要'regex = True' – jezrael