熊貓：對數據幀進行採樣

我試圖用大熊貓讀取一個相當大的CSV文件，並將它分成兩個隨機塊，其中一個是10％的數據，另一個是90％。熊貓：對數據幀進行採樣

這是我當前的嘗試：

rows = data.index 
row_count = len(rows) 
random.shuffle(list(rows)) 

data.reindex(rows) 

training_data = data[row_count // 10:] 
testing_data = data[:row_count // 10]

出於某種原因，sklearn當我嘗試使用這些產生的數據框對象中的一個SVM分類裏面拋出這個錯誤：

IndexError: each subindex must be either a slice, an integer, Ellipsis, or newaxis

我想我做錯了。有一個更好的方法嗎？

來源

2012-08-30 Blender

順便說一句，這將不是隨機洗牌正確反正 - 問題是'random.shuffle（名單（行））' 。 'shuffle'改變了它操作的數據，但是當你調用'list（rows）'時，你創建了一個被修改然後被丟棄的'rows'副本 - 底層的pandas Series，rows不會改變。一種解決方案是在那之後調用'rows = list（rows）'，然後'random.shuffle（rows）'和'data.reindex（rows）'。 –

你使用的是什麼版本的熊貓？對我來說你的代碼工作正常（我在git master上）。

另一種方法可以是：

In [117]: import pandas 

In [118]: import random 

In [119]: df = pandas.DataFrame(np.random.randn(100, 4), columns=list('ABCD')) 

In [120]: rows = random.sample(df.index, 10) 

In [121]: df_10 = df.ix[rows] 

In [122]: df_90 = df.drop(rows)

更新的版本（從0.16.1），能夠直接支持： http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html

來源

2012-08-30 07:36:18

另一種方法是使用'np.random.permuation' –

@WesMcKinney：我注意到'np.random.permutation'會從DataFrame中刪除列名，因爲'np.random.permutation'。有熊貓的方法可以在保留列名的同時對數據幀進行混洗嗎？ – hlin117

@hlin df.loc [np.random.permutation（df.index）]將洗刷數據幀並保留列名稱。 –

我發現np.random.choice()新在與NumPy 1.7.0工作得很好這個。

例如，您可以傳遞來自DataFrame和整數10的索引值以選擇10個隨機均勻採樣的行。

rows = np.random.choice(df.index.values, 10) 
sampled_df = df.ix[rows]

來源

2013-06-18 14:41:39 dragoljub

它需要'random.sample'時間的一半.. awesome – gc5

+1使用np.random.choice。另外，如果你有一個概率的pd.Series，那麼你可以從這個索引中選擇：'np.random.choice（prob.index.values，p = prob.values）' – LondonRob

+37

Don'如果您想要取樣而不更換，請忘記指定replace = False。否則，此方法可能會多次對相同的行進行採樣。 –

如果您使用pandas.read_csv加載數據時，通過使用skiprows參數就可以直接品嚐。這裏是我寫的一篇短文 - https://nikolaygrozev.wordpress.com/2015/06/16/fast-and-simple-sampling-in-pandas-when-loading-data-from-files/

來源

2015-06-16 04:24:11 Nikolay

看看itertools.islice – Merlin

這是問題的正確答案。 – redreamality

熊貓0.16.1有一個sample方法。

來源

2015-06-22 03:13:46 hurrial

不錯！但是你仍然需要將所有的數據加載到內存中，對吧？ – Nikolay

我在將數據加載到內存後執行此操作。 – hurrial

在新版本0.16.1：

sample_dataframe = your_dataframe.sample(n=how_many_rows_you_want)

DOC這裏：http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.sample.html

來源

2015-11-17 22:53:28 dval

獲得sample_dataframe後，如何從your_dataframe中減去它？ –

@ChrisNielsen你問，所以你可以做交叉驗證？如果是這樣，我推薦http://scikit-learn.org/stable/modules/cross_validation.html，因爲它直接爲您提供所有訓練和測試數據集（X_train，X_test，y_train，y_test） – dval

熊貓：對數據幀進行採樣

回答

相關問題