- 從CSV文件
- 爲int碼地圖數據幀還原dataframes到類別定義
- 轉換映射列範疇
爲了說明這個過程:
製作樣本數據框:
df = pd.DataFrame(index=pd.np.arange(0,100000))
df.index.name = 'index'
df['categories'] = 'Category'
df['locations'] = 'Location'
n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4)
n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2)
df['categories'] = df['categories'] + pd.Series(n1).astype(str)
df['locations'] = df['locations'] + pd.Series(n2).astype(str)
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null object
locations 100000 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
None
注意大小:2.3+ MB
- 這將大致是您的csv文件的大小。 現在這些數據轉換爲Categorical
:
df['categories'] = df['categories'].astype('category')
df['locations'] = df['locations'].astype('category')
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null category
locations 100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None
注意,在內存佔用降低到976.6 KB
下降,但如果你將它保存到現在CSV:
df.to_csv('test1.csv')
...你會看到這裏面該文件:
index,categories,locations
0,Category1,Location1
1,Category2,Location2
2,Category3,Location1
3,Category4,Location2
這意味着'分類'已被轉換爲字符串保存在csv。 讓我們擺脫了標籤的Categorical
數據後,我們保存的定義:
categories_details = pd.DataFrame(df.categories.drop_duplicates(), columns=['categories'])
print categories_details
categories
index
0 Category1
1 Category2
2 Category3
3 Category4
locations_details = pd.DataFrame(df.locations.drop_duplicates(), columns=['locations'])
print locations_details
index
0 Location1
1 Location2
現在隱蔽Categorical
到int
D型:
for col in df.select_dtypes(include=['category']).columns:
df[col] = df[col].cat.codes
print df.head()
categories locations
index
0 0 0
1 1 1
2 2 0
3 3 1
4 0 0
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null int8
locations 100000 non-null int8
dtypes: int8(2)
memory usage: 976.6 KB
None
將轉換後的數據csv
並注意該文件現在有隻有沒有標籤的號碼。 文件大小也將反映此更改。
df.to_csv('test2.csv')
index,categories,locations
0,0,0
1,1,1
2,2,0
3,3,1
保存定義以及:
categories_details.to_csv('categories_details.csv')
locations_details.to_csv('locations_details.csv')
當你需要恢復的文件,從csv
文件中加載:
df2 = pd.read_csv('test2.csv', index_col='index')
print df2.head()
categories locations
index
0 0 0
1 1 1
2 2 0
3 3 1
4 0 0
print df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null int64
locations 100000 non-null int64
dtypes: int64(2)
memory usage: 2.3 MB
None
categories_details2 = pd.read_csv('categories_details.csv', index_col='index')
print categories_details2.head()
categories
index
0 Category1
1 Category2
2 Category3
3 Category4
print categories_details2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
categories 4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
None
locations_details2 = pd.read_csv('locations_details.csv', index_col='index')
print locations_details2.head()
locations
index
0 Location1
1 Location2
print locations_details2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
locations 2 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
None
現在使用map
更換int
與編碼數據類別描述並將它們轉換爲Categorical
:
df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category')
df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category')
print df2.head()
categories locations
index
0 Category1 Location1
1 Category2 Location2
2 Category3 Location1
3 Category4 Location2
4 Category1 Location1
print df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null category
locations 100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None
請注意內存使用情況,回到我們第一次將數據轉換爲Categorical
時的情況。 如果您需要多次重複此過程,則不應該很難自動執行此過程。
你想用整數代替分類文本,希望它佔用更少的空間? –