裝箱一個數據幀在大熊貓在Python

給出大熊貓以下數據幀：裝箱一個數據幀在大熊貓在Python

import numpy as np 
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})

其中id爲選自由a和b值的每個點的ID，我倉a和b到如何可以在指定（這樣我就可以取每個箱中的中值/平均值a和b）？對於df中的任何給定行，df可能具有NaN值a或b（或兩者）。謝謝。

下面是使用Joe Kington的解決方案和更現實的df的更好的示例。我不確定的事情是如何訪問df.b元素下面的每個df.a組：

a = np.random.random(20) 
df = pandas.DataFrame({"a": a, "b": a + 10}) 
# bins for df.a 
bins = np.linspace(0, 1, 10) 
# bin df according to a 
groups = df.groupby(np.digitize(df.a,bins)) 
# Get the mean of a in each group 
print groups.mean() 
## But how to get the mean of b for each group of a? 
# ...

來源

2013-06-05 user248237dfsf

有可能是一個更有效的方式（我有一種感覺pandas.crosstab在這裏很有用），但這裏是我會怎麼做：

import numpy as np 
import pandas 

df = pandas.DataFrame({"a": np.random.random(100), 
         "b": np.random.random(100), 
         "id": np.arange(100)}) 

# Bin the data frame by "a" with 10 bins... 
bins = np.linspace(df.a.min(), df.a.max(), 10) 
groups = df.groupby(np.digitize(df.a, bins)) 

# Get the mean of each bin: 
print groups.mean() # Also could do "groups.aggregate(np.mean)" 

# Similarly, the median: 
print groups.median() 

# Apply some arbitrary function to aggregate binned data 
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))

編輯：作爲OP是剛剛由該值a分級的的b手段特別要求，只是做

groups.mean().b

此外，如果你想索引看起來更好（例如，顯示間隔作爲索引），就像他們在@ bdiamante的例子中所做的那樣，使用pandas.cut而不是。（榮譽給bidamante我不知道pandas.cut存在。）

import numpy as np 
import pandas 

df = pandas.DataFrame({"a": np.random.random(100), 
         "b": np.random.random(100) + 10}) 

# Bin the data frame by "a" with 10 bins... 
bins = np.linspace(df.a.min(), df.a.max(), 10) 
groups = df.groupby(pandas.cut(df.a, bins)) 

# Get the mean of b, binned by the values in a 
print groups.mean().b

這導致：

a 
(0.00186, 0.111] 10.421839 
(0.111, 0.22]  10.427540 
(0.22, 0.33]  10.538932 
(0.33, 0.439]  10.445085 
(0.439, 0.548]  10.313612 
(0.548, 0.658]  10.319387 
(0.658, 0.767]  10.367444 
(0.767, 0.876]  10.469655 
(0.876, 0.986]  10.571008 
Name: b

來源

2013-06-05 20:42:45

優秀，優雅！正是我所期待的。根本不需要對數據幀進行排序。 – user248237dfsf

如果你想訪問基於組的「b」值，該怎麼辦？ 'groups.mean（）'給你提供了'a'的手段，我相信。 – user248237dfsf

@ user248237dfsf - 不，它給出了'a'和'b'的意思（或者說，它給出了'a'中的值的b的平均值，這正是我以爲你所要求的）。 –

不是100％肯定，如果這是你在找什麼，但這裏是我以爲你在說：

In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)}) 

In [145]: bins = [0, .25, .5, .75, 1] 

In [146]: a_bins = df.a.groupby(cut(df.a,bins)) 

In [147]: b_bins = df.b.groupby(cut(df.b,bins)) 

In [148]: a_bins.agg([mean,median]) 
Out[148]: 
       mean median 
a 
(0, 0.25] 0.124173 0.114613 
(0.25, 0.5] 0.367703 0.358866 
(0.5, 0.75] 0.624251 0.626730 
(0.75, 1] 0.875395 0.869843 

In [149]: b_bins.agg([mean,median]) 
Out[149]: 
       mean median 
b 
(0, 0.25] 0.147936 0.166900 
(0.25, 0.5] 0.394918 0.386729 
(0.5, 0.75] 0.636111 0.655247 
(0.75, 1] 0.851227 0.838805

當然，我不知道是什麼，你裝倉在腦子裏，所以你必須換挖掘出你的情況。

來源

2013-06-05 20:42:58 bdiamante

不錯！我認爲OP想要用「a」來「b」，但回想起來，你的答案可能是他們想要的。我會離開我的，因爲我們的答案做的事情稍有不同。 –

也許值得一提的是它是'pandas.Dataframe（{..}）'和'a_bins.agg（[numpy.mean，numpy.median]）' – Guido

Joe Kington的回答非常有幫助，但是，我注意到它並沒有包含所有的數據。它實際上使a = a.min（）排除。總結groups.size()給出了99而不是100。

爲了保證所有的數據都是分級的，只需傳遞bin的數量到cut（），那個函數會自動填充第一個[last] bin 0.1％，以確保所有數據數據包括在內。

df = pandas.DataFrame({"a": np.random.random(100), 
        "b": np.random.random(100) + 10}) 

# Bin the data frame by "a" with 10 bins... 
groups = df.groupby(pandas.cut(df.a, 10)) 

# Get the mean of b, binned by the values in a 
print(groups.mean().b)

在這種情況下，總結groups.size（）給了100

我知道這是這個特殊問題挑剔一點，但對於類似的問題，我試圖解決，這是獲得正確答案至關重要。

來源

2014-05-16 02:26:51 Perk

如果你沒有堅持pandas分組，您可以使用scipy.stats.binned_statistic：

from scipy.stats import binned_statistic 

means = binned_statistic(df.a, df.b, bins=np.linspace(min(df.a), max(df.a), 10))

來源

2017-10-30 10:46:25 bio

裝箱一個數據幀在大熊貓在Python

回答

相關問題