2017-05-20 71 views
2

如果我讀只是一塊CSV的我得到的數據結構以下的毗連改變類別類型到對象/ float64

<class 'pandas.core.frame.DataFrame'> 
MultiIndex: 100000 entries, (2015-11-01 00:00:00, 4980770) to (2016-06-01 00:00:00, 8850573) 
Data columns (total 5 columns): 
CHANNEL   100000 non-null category 
MCC    92660 non-null category 
DOMESTIC_FLAG 100000 non-null category 
AMOUNT   100000 non-null float32 
CNT    100000 non-null uint8 
dtypes: category(3), float32(1), uint8(1) 
memory usage: 1.9+ MB 

如果我在閱讀整個CSV和CONCAT塊按照上述我得到如下結構:

<class 'pandas.core.frame.DataFrame'> 
MultiIndex: 30345312 entries, (2015-11-01 00:00:00, 4980770) to (2015-08-01 00:00:00, 88838) 
Data columns (total 5 columns): 
CHANNEL   object 
MCC    float64 
DOMESTIC_FLAG category 
AMOUNT   float32 
CNT    uint8 
dtypes: category(1), float32(1), float64(1), object(1), uint8(1) 
memory usage: 784.6+ MB 

爲什麼分類變量改爲object/float64?我怎樣才能避免這種類型的變化? ESP。在float64

這是級聯代碼:

df = pd.concat([process(chunk) for chunk in reader]) 

處理功能只是做一些清潔和類型分配

+0

你可以發佈你用來加載和連接它的代碼嗎? –

+0

分類也有'NaN'問題,有時 –

+0

現在加入到文本 – snovik

回答

1

考慮下面的示例DataFrames:

In [93]: df1 
Out[93]: 
    A B 
0 a a 
1 b b 
2 c c 
3 a a 

In [94]: df2 
Out[94]: 
    A B 
0 b b 
1 c c 
2 d d 
3 e e 

In [95]: df1.info() 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 4 entries, 0 to 3 
Data columns (total 2 columns): 
A 4 non-null object 
B 4 non-null category 
dtypes: category(1), object(1) 
memory usage: 140.0+ bytes 

In [96]: df2.info() 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 4 entries, 0 to 3 
Data columns (total 2 columns): 
A 4 non-null object 
B 4 non-null category 
dtypes: category(1), object(1) 
memory usage: 148.0+ bytes 

注:這兩個DF有不同的類別:

In [97]: df1.B.cat.categories 
Out[97]: Index(['a', 'b', 'c'], dtype='object') 

In [98]: df2.B.cat.categories 
Out[98]: Index(['b', 'c', 'd', 'e'], dtype='object') 

,當我們將它們連接起來大熊貓不會合並類別 - 這將創建一個object列:

In [99]: m = pd.concat([df1, df2]) 

In [100]: m.info() 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 8 entries, 0 to 3 
Data columns (total 2 columns): 
A 8 non-null object 
B 8 non-null object 
dtypes: object(2) 
memory usage: 192.0+ bytes 

但是,如果我們連接兩個DFS中使用相同的類別 - 一切正常:

In [102]: m = pd.concat([df1.sample(frac=.5), df1.sample(frac=.5)]) 

In [103]: m 
Out[103]: 
    A B 
3 a a 
0 a a 
3 a a 
2 c c 

In [104]: m.info() 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 4 entries, 3 to 2 
Data columns (total 2 columns): 
A 4 non-null object 
B 4 non-null category 
dtypes: category(1), object(1) 
memory usage: 92.0+ bytes 
+0

後的所有列我看到。所以唯一的辦法是重新連接所有類型後連接... – snovik

+0

@snovik,AFAIK目前這是要走的路 – MaxU