2017-01-14 88 views
1

我試圖從熊貓數據框中的兩列創建索引。但是,在索引中使用「bucketed」值之前,我首先要在其中一列中「存儲」值。從多列(包括自動生成的列)的數據幀中分組數據

下面的代碼將有助於進一步解釋:

import numpy as np 
import pandas as pd 

# No error checking, pseudocode ... 
def bucket_generator(source_data, colname, step_size): 
    # create bucket column (string) 
    source_data['bucket'] = '' 

    # obtain the series to operate on 
    series = source_data['colname'] 

    # determine which bucket number each cell in series would belong to, 
    # by dividing the cell value by the step_size 

    # Naive way would be to iterate over cells in series, generating a 
    # bucket label like "bucket_{0:+}".format(cell_value/step_size), 
    # then stick it in a cell in the bucket column, but there must be a more 
    # 'dataframe' way of doing it, rather than looping 





data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20), 
     'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302. 
     4,-79.8,-257.6,89.6,95.7), 
     'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,2 
     0.2) 
     } 

df = pd.DataFrame(data) 

df 

    a  b  c 
0 10 98.5 12.50 
1 3 107.2 23.40 
2 5 350.0 11.50 
3 7 211.2 45.20 
4 15 120.5 17.60 
5 20 -70.8 19.50 
6 10 135.9 0.25 
7 3 205.1 33.60 
8 5 -12.8 18.90 
9 7 280.5 6.50 
10 19 -19.7 12.50 
11 5 77.2 26.20 
12 7 88.2 5.20 
13 5 69.2 0.30 
14 10 101.2 7.20 
15 5 -302.4 8.90 
16 3 -79.8 2.10 
17 7 -257.6 3.10 
18 20 89.6 19.10 
19 20 95.7 20.20 

這就是我想做的事:

  1. 正確執行功能bucket_generator
  2. 集團通過的cols數據幀數據 'A' THEN 'bucket'標籤
  3. 從數據框中爲「a」列中的給定值(整數)和存儲區列中的「存儲桶」標籤選擇行。

回答

1

新建答案

關注什麼OP要求

def bucket_generator(source_data, colname, step_size): 
    series = source_data[colname] 
    source_data['bucket'] = 'bucket_' + (series // step_size).astype(int).astype(str) 

data = {'a': (10,3,5,7,15,20,10,3,5,7,19,5,7,5,10,5,3,7,20,20), 
     'b': (98.5,107.2,350,211.2,120.5,-70.8,135.9,205.1,-12.8,280.5,-19.7,77.2,88.2,69.2,101.2,-302.4,-79.8,-257.6,89.6,95.7), 
     'c': (12.5,23.4,11.5,45.2,17.6,19.5,0.25,33.6,18.9,6.5,12.5,26.2,5.2,0.3,7.2,8.9,2.1,3.1,19.1,20.2) 
     } 

df = pd.DataFrame(data) 
bucket_generator(df, 'a', 5) 

df1 = df.set_index(['a', 'bucket']).sort_index(kind='mergesort') 
print(df1.xs((3, 'bucket_0')).reset_index()) 

dob = {bucket: group for bucket, group in df.groupby(['a', 'bucket'])} 
print(dob[(3, 'bucket_0')]) 

    a bucket  b  c 
0 3 bucket_0 107.2 23.4 
1 3 bucket_0 205.1 33.6 
2 3 bucket_0 -79.8 2.1 
    a  b  c bucket 
1 3 107.2 23.4 bucket_0 
7 3 205.1 33.6 bucket_0 
16 3 -79.8 2.1 bucket_0 

老回答

  • 分配到的df指數要作爲指數水平級別的列表。
  • 使用pd.qcut與bucketizing幫助
  • 使用列表理解,幫助與標籤

def enlabeler(s, n): 
    return ['{}_{}'.format(s, i) for i in range(n)] 

df.index = [ 
    pd.qcut(df.a, 3, enlabeler('a', 3)), 
    pd.qcut(df.b, 3, enlabeler('b', 3)), 
    pd.qcut(df.c, 3, enlabeler('c', 3)) 
] 

print(df) 

enter image description here


一點更動態,並與一個子集列

def enlabeler(s, n): 
    return ['{}_{}'.format(s, i) for i in range(n)] 

def cutcol(c, n): 
    return pd.qcut(c, n, enlabeler(c.name, n)) 

df.index = df[['a', 'b']].apply(cutcol, n=3).values.T.tolist() 

enter image description here

+0

我想你可能略有誤解我試圖解決這個問題。我試圖創建一個桶標籤(我現在已經學會了如何使用'apply'),但是一旦我有了新的標籤列(列'bucket_id'),我想索引數據數據框中的數據列** a **然後列** bucket_id ** –

+0

@HomunculusReticulli是的,我明白了。我向你提供了足夠的信息來做到這一點。我可以進一步解釋。 – piRSquared

+0

@HomunculusReticulli我更新了我的帖子 – piRSquared