替換類別數據（大熊貓）

我有幾個大類文件和幾個類別列。類別也是一種慷慨的詞，因爲這些基本上是描述/部分句子。替換類別數據（大熊貓）

這裏有獨特的價值觀每個類別：

Category 1 = 15 
Category 2 = 94 
Category 3 = 294 
Category 4 = 401 

Location 1 = 30 
Location 2 = 60

再就是即使用戶經常性的數據（姓，名，標識等）。

我想下面的解決方案，以使文件更小：

1）創建的每個分類均具有一個唯一的整數

2）創建一個映射（相匹配的文件是有辦法讀取其他文件做到這一點？就像我將創建一個.csv並加載它作爲其他數據框中，然後搭配呢？還是我從字面上最初鍵入出來？）

3）基本上做一個連接（VLOOKUP）然後刪除他與長對象名稱的舊列

pd.merge(df1, categories, on = 'Category1', how = 'left') 
del df1['Category1']

在這種情況下，人們通常會做什麼？這些文件非常龐大。 60列和大部分數據很長，重複類別和時間戳。根本沒有數字數據。這對我來說很好，但由於共享驅動器空間分配超過幾個月，共享這些文件幾乎是不可能的。

來源

2015-05-11 trench

你想用整數代替分類文本，希望它佔用更少的空間？ –

從Categorical D型節能到csv你可能想按照這個程序時受益：

提取您的類別定義爲獨立的dataframes /文件
將您的分類數據爲int碼
保存將數據幀轉換爲csv以及定義數據幀

當您需要再次使用它們時：

從CSV文件

爲int碼地圖數據幀還原dataframes到類別定義

轉換映射列範疇

爲了說明這個過程：

製作樣本數據框：

df = pd.DataFrame(index=pd.np.arange(0,100000)) 
df.index.name = 'index' 
df['categories'] = 'Category' 
df['locations'] = 'Location' 
n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4) 
n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2) 
df['categories'] = df['categories'] + pd.Series(n1).astype(str) 
df['locations'] = df['locations'] + pd.Series(n2).astype(str) 
print df.info() 

    <class 'pandas.core.frame.DataFrame'> 
Int64Index: 100000 entries, 0 to 99999 
Data columns (total 2 columns): 
categories 100000 non-null object 
locations  100000 non-null object 
dtypes: object(2) 
memory usage: 2.3+ MB 
None

注意大小：2.3+ MB - 這將大致是您的csv文件的大小。現在這些數據轉換爲Categorical：

df['categories'] = df['categories'].astype('category') 
df['locations'] = df['locations'].astype('category') 
print df.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 100000 entries, 0 to 99999 
Data columns (total 2 columns): 
categories 100000 non-null category 
locations  100000 non-null category 
dtypes: category(2) 
memory usage: 976.6 KB 
None

注意，在內存佔用降低到976.6 KB 下降，但如果你將它保存到現在CSV：

df.to_csv('test1.csv')

...你會看到這裏面該文件：

index,categories,locations 
0,Category1,Location1 
1,Category2,Location2 
2,Category3,Location1 
3,Category4,Location2

這意味着'分類'已被轉換爲字符串保存在csv。讓我們擺脫了標籤的Categorical數據後，我們保存的定義：

categories_details = pd.DataFrame(df.categories.drop_duplicates(), columns=['categories']) 
print categories_details 

     categories 
index   
0  Category1 
1  Category2 
2  Category3 
3  Category4 

locations_details = pd.DataFrame(df.locations.drop_duplicates(), columns=['locations']) 
print locations_details 

     index   
0  Location1 
1  Location2

現在隱蔽Categorical到int D型：

for col in df.select_dtypes(include=['category']).columns: 
    df[col] = df[col].cat.codes 
print df.head() 

     categories locations 
index      
0    0   0 
1    1   1 
2    2   0 
3    3   1 
4    0   0 

print df.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 100000 entries, 0 to 99999 
Data columns (total 2 columns): 
categories 100000 non-null int8 
locations  100000 non-null int8 
dtypes: int8(2) 
memory usage: 976.6 KB 
None

將轉換後的數據csv並注意該文件現在有隻有沒有標籤的號碼。文件大小也將反映此更改。

df.to_csv('test2.csv') 

index,categories,locations 
0,0,0 
1,1,1 
2,2,0 
3,3,1

保存定義以及：

categories_details.to_csv('categories_details.csv') 
locations_details.to_csv('locations_details.csv')

當你需要恢復的文件，從csv文件中加載：

df2 = pd.read_csv('test2.csv', index_col='index') 
print df2.head() 

     categories locations 
index      
0    0   0 
1    1   1 
2    2   0 
3    3   1 
4    0   0 

print df2.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 100000 entries, 0 to 99999 
Data columns (total 2 columns): 
categories 100000 non-null int64 
locations  100000 non-null int64 
dtypes: int64(2) 
memory usage: 2.3 MB 
None 

categories_details2 = pd.read_csv('categories_details.csv', index_col='index') 
print categories_details2.head() 

     categories 
index   
0  Category1 
1  Category2 
2  Category3 
3  Category4 

print categories_details2.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 4 entries, 0 to 3 
Data columns (total 1 columns): 
categories 4 non-null object 
dtypes: object(1) 
memory usage: 64.0+ bytes 
None 

locations_details2 = pd.read_csv('locations_details.csv', index_col='index') 
print locations_details2.head() 

     locations 
index   
0  Location1 
1  Location2 

print locations_details2.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 2 entries, 0 to 1 
Data columns (total 1 columns): 
locations 2 non-null object 
dtypes: object(1) 
memory usage: 32.0+ bytes 
None

現在使用map更換int與編碼數據類別描述並將它們轉換爲Categorical：

df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category') 
df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category') 
print df2.head() 

     categories locations 
index      
0  Category1 Location1 
1  Category2 Location2 
2  Category3 Location1 
3  Category4 Location2 
4  Category1 Location1 

print df2.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 100000 entries, 0 to 99999 
Data columns (total 2 columns): 
categories 100000 non-null category 
locations  100000 non-null category 
dtypes: category(2) 
memory usage: 976.6 KB 
None

請注意內存使用情況，回到我們第一次將數據轉換爲Categorical時的情況。如果您需要多次重複此過程，則不應該很難自動執行此過程。

來源

2015-05-11 19:57:20 Primer

這看起來正是我想要做的。讓我明天測試一下，如果它能正常工作，我會投票給它添加答案。 – trench

熊貓有一個Categorical數據類型，就是這麼做的。它基本上將類別映射到幕後的整數。

在內部，所述數據結構包括一個類別陣列的並且指向實際值的類別陣列中的代碼的整數數組。

文檔是here。

來源

2015-05-11 16:49:18 Alexander

對不起，我應該更清楚了。我使用分類類型進行分析，它完美地工作，但是當我將這些文件存儲爲.csvs時，它們仍然非常大。我也將這些.csvs連接到Tableau，所以我想我也可以在Tableau中進行連接。 – trench

我不知道任何將分類轉換爲其他存儲格式（除了醃菜...）。有一點要記住的是字典版本。如果您更改或修改類別，以前的編碼將不再有效。小心存儲商店並與csv同步詞典。您可以通過枚舉類別來創建字典。 – Alexander

這裏有一個方法來保存直言列的數據框在一個單一的.csv：

Example: 
------  ------- 
Fatcol  Thincol: unique strings once, then numbers 
------  ------- 
"Alberta" "Alberta" 
"BC"  "BC" 
"BC"  2 -- string 2 
"Alberta" 1 -- string 1 
"BC"  2 
... 

The "Thincol" on the right can be saved as is in a .csv file, 
and expanded to the "Fatcol" on the left after reading it in; 
this can halve the size of big .csv s with repeated strings. 

Functions 
--------- 
fatcol(col: Thincol) -> Fatcol, list[ unique str ] 
thincol(col: Fatcol) -> Thincol, dict(unique str -> int), list[ unique str ] 

Here "Fatcol" and "Thincol" are type names for iterators, e.g. lists: 
    Fatcol: list of strings 
    Thincol: list of strings or ints or NaN s 
If a `col` is a `pandas.Series`, its `.values` are used.

這削減700M的.csv到248M - 在〜1 MB /秒對我的iMac，但write_csv運行。

來源

2017-11-07 17:50:06 denis

替換類別數據（大熊貓）

回答

相關問題