2016-03-15 32 views
1

我有記錄,其中字段(稱爲INDATUMAUTDATUMA)應該包含20010101和20141231範圍內的數字(出於顯而易見的原因)。爲了允許缺失的值,但保留精度到最近的日期,我會將它們存儲爲浮點數(np.float64)。我希望這會迫使偶爾錯誤格式化的字段(想到2oo41oo9)變成NA s,但是卻在熊貓0.18.0或IOPro 1.7.2中打破了導入。在numpy中強制非數字字符到NAs(當讀取csv到熊貓數據幀時)

有什麼可以使用的未公開的選項?要不然?

爲大熊貓嘗試的關鍵線路

import numpy as np 
import pandas as pd 
treatments = pd.read_table(filename,usecols=[0,3,4,6], engine='c', dtype={'LopNr':np.uint32,'INDATUMA':np.float64,'UTDATUMA':np.float64,'DIAGNOS':object}) 

隨着eror ValueError: invalid literal for float(): 2003o730

我試着在IOPro以下,以防萬一:

import iopro 
adapter = iopro.text_adapter(filename, parser='csv',delimiter='\t',output='dataframe',infer_types=False) 
adapter.set_field_types({0: 'u4',3:'f8', 4:'f8',6:'object'}) 
all_treatments.append(adapter[[0,3,4,6]][:]) 

但是,這也與iopro.lib.errors.DataTypeError: Could not convert token "2003o730" at record 1 field 3 to float64.Reason: unknown

打破了數據文件的開始爲

LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD 
1562 21001 046 20030707 20030711 I489A I489A I509  2      10 
1562 21001 046 2003o730 20030801 I501 I501 I489A DG001 2      10 

回答

1

可以在read_table使用參數converters

def converter(num): 
    try: 
     return np.float(num) 
    except: 
     return np.nan 

#define each column 
converters={'INDATUMA': converter, 'UTDATUMA': converter} 

df = pd.read_table(filename, converters=converters) 
print df 
    LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS  OP PVARD \ 
0 1562 21001 46 20030707 20030711 I489A I489A I509  2 
1 1562 21001 46  NaN 20030801 I501 I501 I489A DG001 

    EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD 
0  10 NaN NaN NaN NaN NaN 
1  2  10 NaN NaN NaN NaN 

或者後期處理參數errors='coerce'to_numeric的:

df['INDATUMA'] = pd.to_numeric(df['INDATUMA'], errors='coerce') 
0 20030707 
1   NaN 
Name: INDATUMA, dtype: float64