HDFStore：table.select和RAM使用

我想從大約1 GB的HDFStore表中選擇隨機行。當我詢問大約50個隨機行時，RAM使用情況爆發。HDFStore：table.select和RAM使用

我正在使用熊貓0-11-dev, python 2.7, linux64。

在第一種情況下的內存使用適合的chunk

with pd.get_store("train.h5",'r') as train: 
for chunk in train.select('train',chunksize=50): 
    pass

大小在第二種情況下，它似乎是整個表被加載到RAM

r=random.choice(400000,size=40,replace=False) 
train.select('train',pd.Term("index",r))

在後一種情況下， RAM使用量適合等效chunk尺寸

r=random.choice(400000,size=30,replace=False)  
train.select('train',pd.Term("index",r))

我很困惑，爲什麼從30行到40行隨機行會導致RAM使用量急劇增加。

注創建諸如當表已被索引，使用下面的代碼索引=範圍（NROWS（表））：

def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000): 
    max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize) 

    with pd.get_store(storefile,'w') as store: 
     for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))): 
      chunk.index= range(chunksize*(i), chunksize*(i+1))[:chunk.shape[0]] 
      store.append(table_name,chunk, min_itemsize={'values':max_len})

感謝洞察

編輯以ANSWER Zelazny7

這是我用來寫Train.csv到train.h5的文件。我寫的Zelazny7代碼此使用的元素從How to trouble-shoot HDFStore Exception: cannot find the correct atom type

import pandas as pd 
import numpy as np 
from sklearn.feature_extraction import DictVectorizer 


def object_max_len(x): 
    if x.dtype != 'object': 
     return 
    else: 
     return len(max(x.fillna(''), key=lambda x: len(str(x)))) 

def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000): 
    max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply(object_max_len).max() 
    dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes 

    for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize): 
     max_len = max((pd.DataFrame(chunk.apply(object_max_len)).max(),max_len)) 
     for i,k in enumerate(zip(dtypes0[:], chunk.dtypes)): 
      if (k[0] != k[1]) and (k[1] == 'object'): 
       dtypes0[i] = k[1] 
    #as of pandas-0.11 nan requires a float64 dtype 
    dtypes0.values[dtypes0 == np.int64] = np.dtype('float64') 
    return max_len, dtypes0 


def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000): 
    max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize) 

    with pd.get_store(storefile,'w') as store: 
     for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))): 
      chunk.index= range(chunksize*(i), chunksize*(i+1))[:chunk.shape[0]] 
      store.append(table_name,chunk, min_itemsize={'values':max_len})

應用爲

txtfile2hdfstore('Train.csv','train.h5','train',sep=',')

來源

2013-04-09 user17375

您似乎正在使用HDFStore以類似的方式，我想如何使用它。我沒有時間去創建處理大量存儲和檢索的包裝器代碼。你介意分享你的'txtfile2dtypes'代碼嗎？另外，你的數據是否有很多字符數據？將csv文件存儲到帶有可變字符數據的HDFStore時遇到問題。由於我必須將'min_itemsize'設置爲如此大的值，因此文件大小會變大。我急切地等待添加一個'truncate'選項。 – Zelazny7 2013-04-09 13:54:09

@ Zelazny7我用代碼更新了線程。事實上，我正在使用它與您的相同的數據，Kaggle的推土機的東西。我還沒有將分類變量虛擬成「sklearn」。 – user17375 2013-04-09 14:44:40

非常感謝！它看起來像你的文件大小與我的相同。〜120MB文件最終超過1GB。我想知道你或Jeff是否會知道使用'put'來存儲可變長度的'object'列（實際上只是字符串）是更好的，並將每個文本列保留爲它自己的HDFStore對象。 – Zelazny7 2013-04-09 14:53:09

這是一個已知的問題，在這裏看到的參考：https://github.com/pydata/pandas/pull/2755

本質上查詢變成numexpr表達評估。有一個問題，在這裏我不能將很多or條件傳遞給numexpr（它取決於生成的表達式的的總長度）。

所以我只是限制了我們傳遞給numexpr的表達式。如果它超過了一定數量的or條件，那麼查詢將作爲過濾器完成，而不是內核選擇。基本上這意味着表格被讀取並重新編制索引。

這是在我的增強列表中：https://github.com/pydata/pandas/issues/2391（17）。

作爲一種解決方法，只需將您的查詢拆分爲多個並對結果進行連接即可。應該更快，並使用恆定數量的內存

來源

2013-04-09 12:15:48 Jeff

好的，謝謝。我錯過了這個問題，我應該先搜索github論壇。順便說一下，我只是意識到你是hdfstore開發人員，所以感謝你的偉大工作！ – user17375 2013-04-09 12:59:26

這是相當隱晦和不幸的容易錯過:) – Jeff 2013-04-09 13:20:02

HDFStore：table.select和RAM使用

回答

相關問題