2017-02-04 21 views
0

我有兩個很多列(順序10)具有不同長度(每行是一條記錄)的數據集,必須成爲相同的行數:條件是binning在多個列中,從2到4,然後刪除兩個數據集之一中的超出記錄(在該數據倉中的所有記錄之間隨機選取)。python:從比較兩個直方圖的數據集中移除記錄

我目前使用numpy,但使用pandas也可以。由於我事先知道一個數據集比另一個數據集小(我的天真讓我說)想法是計算兩個直方圖(較小的第一個),從另一個減去一個直方圖以獲得每個箱的差異,步行要刪除超出的記錄的數據集,但:我必須知道什麼記錄是在什麼垃圾箱!

一段代碼,計算直方圖中python(兩列數據集爲簡單起見):

import numpy as np 
import numpy.random as rd 
x = 50*rd.random((100, 5)) 
np.histogram2d(x[:, 0], x[:, 1], bins=[10, 5]) 

有沒有辦法來跟蹤數據集指標的分檔是什麼時候? 我知道pandas數據幀可以有索引,所以他們可以自然選擇,只要我堅持用這個算法

有沒有更聰明的方法來做到這一點,改變算法,但堅持使用python?

+0

does [numpy.digitize](https://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html)有幫助嗎? –

回答

0

我使用pandas發現了一個很好的解決方案。

import pandas as pd, numpy as np 
x = 50 * np.random.randn(50, 5) 
dfx = pd.DataFrame(x) 
bins = np.linspace(min(dfx[0]), max(dfx[0]), 10) 
first_binning = pd.cut(dfx[0], bins) 
bins = np.linspace(min(dfx[1]), max(dfx[1]), 5) 
second_binning = pd.cut(ddx[1], bins) 
groups = dfx.groupby([first_binning, second_binning]) 

,現在你可以(取決於您的數據):

In [160]: groups.size() 
Out[160]: 
0     1 
(-101.273, -71.403] (50.481, 109.902]  2 
(-71.403, -41.532] (-68.362, -8.94]  4 
        (-8.94, 50.481]  3 
        (50.481, 109.902]  1 
(-41.532, -11.661] (-68.362, -8.94]  4 
        (-8.94, 50.481]  3 
        (50.481, 109.902]  2 
(-11.661, 18.21]  (-127.783, -68.362] 2 
        (-8.94, 50.481]  6 
        (50.481, 109.902]  1 
(18.21, 48.0806]  (-127.783, -68.362] 2 
        (-68.362, -8.94]  5 
        (-8.94, 50.481]  3 
        (50.481, 109.902]  3 
(48.0806, 77.951] (-68.362, -8.94]  2 
        (-8.94, 50.481]  4 
(77.951, 107.822] (-68.362, -8.94]  1 
dtype: int64 

看到計數和

In [163]: groups.indices 
Out[163]: 
{('(-101.273, -71.403]', '(50.481, 109.902]'): array([20, 37]), 
('(-11.661, 18.21]', '(-127.783, -68.362]'): array([26, 39]), 
('(-11.661, 18.21]', '(-8.94, 50.481]'): array([ 4, 14, 18, 34, 35,  45]), 
('(-11.661, 18.21]', '(50.481, 109.902]'): array([17]), 
('(-41.532, -11.661]', '(-68.362, -8.94]'): array([ 3, 13, 16, 30]), 
('(-41.532, -11.661]', '(-8.94, 50.481]'): array([25, 38, 48]), 
('(-41.532, -11.661]', '(50.481, 109.902]'): array([0, 5]), 
('(-71.403, -41.532]', '(-68.362, -8.94]'): array([ 1, 24, 32, 47]), 
('(-71.403, -41.532]', '(-8.94, 50.481]'): array([ 6, 19, 31]), 
('(-71.403, -41.532]', '(50.481, 109.902]'): array([12]), 
('(18.21, 48.0806]', '(-127.783, -68.362]'): array([21, 46]), 
('(18.21, 48.0806]', '(-68.362, -8.94]'): array([ 2, 15, 22, 33, 40]), 
('(18.21, 48.0806]', '(-8.94, 50.481]'): array([ 7, 28, 36]), 
('(18.21, 48.0806]', '(50.481, 109.902]'): array([ 9, 23, 49]), 
('(48.0806, 77.951]', '(-68.362, -8.94]'): array([41, 42]), 
('(48.0806, 77.951]', '(-8.94, 50.481]'): array([27, 29, 43, 44]), 
('(77.951, 107.822]', '(-68.362, -8.94]'): array([11])} 

看,當然數據集記錄索引。