2017-10-06 52 views
1

我的數據集有很多列包含$逗號,例如逗號, $ 150,000.50。一旦我導入數據集:

datasets = pd.read_csv('salaries-by-college-type.csv') 

由於一串值爲$ values的列,因此imputer對象失敗。我如何糾正它在Python程序

這是我的數據集。除了學校類型休息都有$逗號逗號。有沒有從這些欄刪除這些$和逗號一個通用的方法值

School Type       269 non-null object 
Starting Median Salary    269 non-null float64 
Mid-Career Median Salary    269 non-null float64 
Mid-Career 10th Percentile Salary 231 non-null float64 
Mid-Career 25th Percentile Salary 269 non-null float64 
Mid-Career 75th Percentile Salary 269 non-null float64 
Mid-Career 90th Percentile Salary 231 non-null float64 

這裏是我的數據集的樣本:

School Type Starting Median Salary Mid-Career Median Salary Mid-Career 10th Percentile Salary Mid-Career 25th Percentile Salary Mid-Career 75th Percentile Salary Mid-Career 90th Percentile Salary 
Engineering $72,200.00 $126,000.00  $76,800.00 $99,200.00 $168,000.00  $220,000.00 
Engineering $75,500.00 $123,000.00  N/A $104,000.00  $161,000.00  N/A 
Engineering $71,800.00 $122,000.00  N/A $96,000.00 $180,000.00  N/A 
Engineering $62,400.00 $114,000.00  $66,800.00 $94,300.00 $143,000.00  $190,000.00 
Engineering $62,200.00 $114,000.00  N/A $80,200.00 $142,000.00  N/A 
Engineering $61,000.00 $114,000.00  $80,000.00 $91,200.00 $137,000.00  $180,000.00 
+0

'df.column = df.column.str.strip('$')' –

+0

謝謝...... 15,000.50中的逗號怎麼樣? – Kda

+0

'... strip(「,」)' – Fallenreaper

回答

2

假設你有一個csv,看起來像這樣。
注意:我真的不知道你的csv是什麼樣子。確保相應地調整read_csv參數。最具體而言,參數爲sep

h1|h2 
a|$1,000.99 
b|$500,000.00 

使用在pd.read_csv
converters參數傳遞一個字典,你想轉換爲鍵的列的名稱和是否轉換爲數值的功能。

pd.read_csv(
    'salaries-by-college-type.csv', sep='|', 
    converters=dict(h2=lambda x: float(x.strip('$').replace(',', ''))) 
) 

    h1   h2 
0 a 1000.99 
1 b 500000.00 

或者,假設您導入數據框已經

df = pd.read_csv(
    'salaries-by-college-type.csv', sep='|' 
) 

然後使用pd.Series.str.replace

df.h2 = df.h2.str.replace('[^\d\.]', '').astype(float) 

df 

    h1   h2 
0 a 1000.99 
1 b 500000.00 

或者pd.DataFrame.replace

df.replace(dict(h2='[^\d\.]'), '', regex=True).astype(dict(h2=float)) 

    h1   h2 
0 a 1000.99 
1 b 500000.00 
+0

這裏是我的數據集,除了第一列休息都有$和逗號值,我如何得到一般的更正。 – Kda

+0

學校類型269非空對象 啓動工資中位數269非空float64 中等職業平均年薪269非空float64 中間事業第10個百分工資231非空float64 中間事業第25個百分工資269非空float64 中間事業75百分位數工資269非空float64 中間事業90分位點工資231非空float64 – Kda

+0

@Kda你需要編輯你的問題和過去的數據存在。 – piRSquared