2017-02-13 40 views
2

我將我的熊貓數據框列因式分解,但覆蓋原始列值。pandas從因式分解的數據框中獲取字符串標籤

有什麼辦法可以將原始映射值返回供參考嗎?

例子:

df_test = pd.DataFrame({'col1': pd.Series(['cat','dog','cat','mouse'])}) 
df_test['col1'] = pd.factorize(df_test['col1'])[0] 
df_test 

enter image description here

但是我希望能夠在下面再次調用檢查什麼的整數映射到。有沒有什麼辦法可以在不重新初始化數據框的情況下檢查映射?

pd.factorize(df_test)[1] 

回答

1

我建議你稍微不同的方式 - 使用categorical dtype

In [40]: df_test['col1'] = df_test['col1'].astype('category') 

In [41]: df_test 
Out[41]: 
    col1 
0 cat 
1 dog 
2 cat 
3 mouse 

In [42]: df_test.dtypes 
Out[42]: 
col1 category 
dtype: object 

,如果你需要的數字:

In [44]: df_test['col1'].cat.codes 
Out[44]: 
0 0 
1 1 
2 0 
3 2 
dtype: int8 

內存使用400K數據框:

In [74]: df_test = pd.DataFrame({'col1': pd.Series(['cat','dog','cat','mouse'])}) 

In [75]: df_test = pd.concat([df_test] * 10**5, ignore_index=True) 

In [76]: df_test.shape 
Out[76]: (400000, 1) 

In [77]: d1 = df_test.copy() 

In [78]: d2 = df_test.copy() 

In [79]: d1.col1 = pd.factorize(d1.col1)[0] 

In [80]: d2.col1 = d2.col1.astype('category') 

In [81]: df_test.info() 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 400000 entries, 0 to 399999 
Data columns (total 1 columns): 
col1 400000 non-null object 
dtypes: object(1) 
memory usage: 3.1+ MB 

In [82]: d1.info() 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 400000 entries, 0 to 399999 
Data columns (total 1 columns): 
col1 400000 non-null int64 
dtypes: int64(1) 
memory usage: 3.1 MB 

In [83]: d2.info() 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 400000 entries, 0 to 399999 
Data columns (total 1 columns): 
col1 400000 non-null category 
dtypes: category(1) 
memory usage: 390.7 KB   # categorical column takes almost 8x times less memory 
+0

我在做什麼就是覆蓋ng與原始分類代碼:'df_test ['col1'] = df_test ['col1'] .cat.codes'。因此,爲了讓我能夠將貓代碼映射回類別,我應該創建2個數據框,其中一個是所有cat.codes,另一個仍然有映射類別?或者,還有更好的方法? – jxn