高效讀寫熊貓數據幀

我有一個熊貓數據幀，我想分成幾個100k行的小塊，然後保存到磁盤上，以便我可以讀取數據並逐個處理它。我試過使用dill和hdf存儲，因爲csv和原始文本似乎需要很長時間。高效讀寫熊貓數據幀

我想這對一個數據的子集〜500K行和混合數據的五列。兩個包含字符串，一個整數，一個浮點數，最後一個包含來自sklearn.feature_extraction.text.CountVectorizer的bigram計數，存儲爲scipy.sparse.csr.csr_matrix稀疏矩陣。

這是我遇到問題的最後一列。轉儲和加載數據沒有問題，但是當我嘗試實際訪問數據時，它是一個pandas.Series對象。其次，該系列中的每一行都是一個包含整個數據集的元組。

# Before dumping, the original df has 100k rows. 
# Each column has one value except for 'counts' which has 1400. 
# Meaning that df['counts'] give me a sparse matrix that is 100k x 1400. 

vectorizer = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2,2)) 
counts = vectorizer.fit_transform(df['string_data']) 
df['counts'] = counts 

df_split = pandas.DataFrame(np.column_stack([df['string1'][0:100000], 
               df['string2'][0:100000], 
               df['float'][0:100000], 
               df['integer'][0:100000], 
               df['counts'][0:100000]]), 
               columns=['string1','string2','float','integer','counts']) 
dill.dump(df, open(file[i], 'w')) 

df = dill.load(file[i]) 
print(type(df['counts']) 
> <class 'pandas.core.series.Series'> 
print(np.shape(df['counts']) 
> (100000,) 
print(np.shape(df['counts'][0]) 
> (496718, 1400) # 496718 is the number of rows in my complete data set. 
print(type(df['counts'])) 
> <type 'tuple'>

上午我做任何明顯的錯誤，或者是有更好的方法來存儲這種格式此數據，其中一個是不是很費時間？它必須可擴展到包含1億行的全部數據。

來源

2017-05-16 Tobias

你是如何創建/追加'count'列？ – MaxU

我將此代碼添加到代碼 – Tobias

我認爲將稀疏矩陣存儲爲熊貓列不是一個好主意 - IMO是一種容易出錯的方式。我會將它們分開存儲... – MaxU

df['counts'] = counts

這將產生一個熊貓系列（列）與＃元素等於並且其中每個元素是稀疏矩陣，它是由vectorizer.fit_transform(df['string_data'])

返回可以嘗試做以下：

df = df.join(pd.DataFrame(counts.A, columns=vectorizer.get_feature_names(), index=df.index)

注：意識到這將引爆你的稀疏矩陣成緻密化的（不稀疏）數據幀，所以它會使用更內存，你可以用MemoryError

結論結束了： 這就是爲什麼我建議你保存原始DF和count稀疏矩陣分別

來源

2017-05-16 13:38:52 MaxU

謝謝，確實大小爆炸。我會按照你的建議分開這兩個。 – Tobias

@Tobias，很高興幫助:) – MaxU

高效讀寫熊貓數據幀

回答

相關問題