2016-12-17 40 views
0

我在包含‘No data’的csv文件上使用df= df.replace('No data', np.nan)而不是空白/空項,其中沒有數字數據。使用head()方法,我發現替換方法用NaN代替‘No data’條目。當我使用df.info()時,它說每個系列的數據類型都是一個對象。熊貓替換方法和對象數據類型

當我在Excel中打開CSV文件,手動使用find編輯數據並更換來改變‘No data’爲空白/空條目,雖然dataframes看,當我使用df.head()完全一樣的,當我使用df.info()它說,數據類型是浮點數。

我想知道爲什麼會這樣,我怎樣才能讓我的系列的數據類型是浮動的,而不必手動編輯csv文件。

+1

原因是爲列被導入爲對象,因爲'沒有數據'是一個字符串文字。沒有它,列導入爲浮動。替換不會轉換數據類型。所以在替換後用'astype'進行轉換。 – Parfait

回答

0
import pandas as pd 
import numpy as np 

# Create data for columns with strings in it 
column_data = [1,2,3] + ['no data'] 
# Costruct data frame with two columns 
df = pd.DataFrame({'col1':column_data, 'col2':column_data[::-1]}) 
df.info() 

<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 4 entries, 0 to 3 
Data columns (total 2 columns): 
col1 4 non-null object 
col2 4 non-null object 
dtypes: object(2) 
memory usage: 144.0+ bytes 

# Replace 'no data' with Nan 
df_nan = df.replace('no data', np.nan) 
# Set type of all columns to float 
df_result = df_nan.as_type({c:float for c in df_nan.columns}) 
df_result.info() 

<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 4 entries, 0 to 3 
Data columns (total 2 columns): 
col1 3 non-null float64 
col2 3 non-null float64 
dtypes: float64(2) 
memory usage: 144.0 bytes 
2

如果數據的列中的其餘部分是數字那麼你應該使用pd.to_numeric(df, errors='coerce')

0

考慮使用轉換器 ARG在pandas.read_csv(),你通過列號引用轉換函數的字典。下面檢查無數據串並有條件地替換np.nan值,否則保留原樣:

import numpy as np 
import pandas as pd 

c_fct = lambda x : float(x if 'No data' not in x else np.nan) 
convertdict = {1:c_fct, 2:c_fct, 3:c_fct, 4:c_fct, 5:c_fct} 

df = pd.read_csv('Input.csv', converters=convertdict) 

輸入CSV

ID  Col1   Col2   Col3   Col4   Col5 
TGG  0.634516647  0.900464347  0.998505978  0.170422713  0.893340128 
GRI  No data   0.045915333  0.718398939  0.924813864  No data 
NLB  0.921127268  0.614460813  0.677857676  0.343612947  0.559437744 
SEI  0.081852313  No data   0.890816385  0.943313021  0.874857844 
LOY  0.632556715  0.362855866  0.038702448  0.253762859  No data 
OPL  0.375088582  0.268283238  0.761552111  0.589547625  0.192223208 
CTK  0.349464541  0.844718987  No data   0.841439909  0.898093646 
EUE  0.629784261  0.982589843  0.315670377  0.832419474  0.950044814 
JLP  0.543942659  0.988380305  0.417191823  0.823857176  0.542514099 
RHK  0.728053447  0.521816539  0.402523435  No data   0.558226706 
AEM  0.005495116  0.715363776  0.075508356  0.959119268  0.844730368 
VLQ  0.21146319  0.558208766  0.501769554  0.226539046  0.795861461 
MDB  0.230514689  0.223163664  No data   0.324636384  0.700716246 
LPH  0.853433224  0.582678173  0.633109347  0.432191426  No data 
PEP  0.41096305  No data   .627776178  0.482359278  0.179863537 
UQK  0.252598809  0.497517585  0.276060768  No data   0.087985623 
KGJ  0.033985585  0.033702088  anNo data  0.286682709  0.543349787 
JUQ  0.25971543  0.142067155  0.597985191  0.219841249  0.699822866 
NYW  No data   0.17187907  0.157413049  0.209011772  0.592824483 

輸出

print(df) 

#  ID  Col1  Col2  Col3  Col4  Col5 
# 0 TGG 0.634517 0.900464 0.998506 0.170423 0.893340 
# 1 GRI  NaN 0.045915 0.718399 0.924814  NaN 
# 2 NLB 0.921127 0.614461 0.677858 0.343613 0.559438 
# 3 SEI 0.081852  NaN 0.890816 0.943313 0.874858 
# 4 LOY 0.632557 0.362856 0.038702 0.253763  NaN 
# 5 OPL 0.375089 0.268283 0.761552 0.589548 0.192223 
# 6 CTK 0.349465 0.844719  NaN 0.841440 0.898094 
# 7 EUE 0.629784 0.982590 0.315670 0.832419 0.950045 
# 8 JLP 0.543943 0.988380 0.417192 0.823857 0.542514 
# 9 RHK 0.728053 0.521817 0.402523  NaN 0.558227 
# 10 AEM 0.005495 0.715364 0.075508 0.959119 0.844730 
# 11 VLQ 0.211463 0.558209 0.501770 0.226539 0.795861 
# 12 MDB 0.230515 0.223164  NaN 0.324636 0.700716 
# 13 LPH 0.853433 0.582678 0.633109 0.432191  NaN 
# 14 PEP 0.410963  NaN 0.627776 0.482359 0.179864 
# 15 UQK 0.252599 0.497518 0.276061  NaN 0.087986 
# 16 KGJ 0.033986 0.033702  NaN 0.286683 0.543350 
# 17 JUQ 0.259715 0.142067 0.597985 0.219841 0.699823 
# 18 NYW  NaN 0.171879 0.157413 0.209012 0.592824 

print(df.types) 

# ID  object 
# Col1 float64 
# Col2 float64 
# Col3 float64 
# Col4 float64 
# Col5 float64 
# dtype: object 
+1

與所有正則表達式匹配的性能如何? – 2016-12-19 02:07:09

+1

好點。 OP應該真的在檢查*沒有數據*,並可以使用'in'運算符來實現,證明這是[更快](http://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match - 搜索 - 或-STR-查找/ 4901653#4901653)。我相應編輯。 – Parfait