2016-08-14 109 views
1

我試圖從輸入:1中取一個值作爲閾值c_med的值,並在輸入的兩個不同輸出中分開上述和下面的值:2。寫above.csv & below.csv參考列c_total由閾值分開

閱讀above.csv作爲輸入,並用純Python中寫入的點2中提及的百分比對它們進行分類。

輸入:1

date_count,all_hours,c_min,c_max,c_med,c_med_med,u_min,u_max,u_med,u_med_med 
2,12,2309,19072,12515,13131,254,785,686,751 

輸入:2 ['date','startTime','endTime','day','c_total','u_total']

2004-01-05,22:00:00,23:00:00,Mon,18944,790 
2004-01-05,23:00:00,00:00:00,Mon,17534,750 
2004-01-06,00:00:00,01:00:00,Tue,17262,747 
2004-01-06,01:00:00,02:00:00,Tue,19072,777 
2004-01-06,02:00:00,03:00:00,Tue,18275,785 
2004-01-06,03:00:00,04:00:00,Tue,13589,757 
2004-01-06,04:00:00,05:00:00,Tue,16053,735 
2004-01-06,05:00:00,06:00:00,Tue,11440,636 
2004-01-06,06:00:00,07:00:00,Tue,5972,513 
2004-01-06,07:00:00,08:00:00,Tue,3424,382 
2004-01-06,08:00:00,09:00:00,Tue,2696,303 
2004-01-06,09:00:00,10:00:00,Tue,2350,262 
2004-01-06,10:00:00,11:00:00,Tue,2309,254 
  1. 我試圖從另一個輸入CSV c_med

我得到以下讀取閾值錯誤:

Traceback (most recent call last): 
    File "class_med.py", line 10, in <module> 
    above_median = df_data['c_total'] > df_med['c_med'] 
    File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 735, in wrapper 
    raise ValueError('Series lengths must match to compare') 
ValueError: Series lengths must match to compare 
  • 濾波器百分比分離的數據列c_total。下面給出的純python解決方案,但我正在尋找熊貓解決方案。像在Reference one

    for row in csv.reader(inp): if int(row[1])<(.20 * max_value): val = 'viewers' elif int(row[1])>=(0.20*max_value) and int(row[1])<(0.40*max_value): val= 'event based'
    elif int(row[1])>=(0.40*max_value) and int(row[1])<(0.60*max_value): val= 'situational' elif int(row[1])>=(0.60*max_value) and int(row[1])<(0.80*max_value): val = 'active' else: val= 'highly active' writer.writerow([row[0],row[1],val])

  • 代碼:

    import pandas as pd 
    import numpy as np 
    
    df_med = pd.read_csv('stat_result.csv') 
    df_med.columns = ['date_count', 'all_hours', 'c_min', 'c_max', 'c_med', 'c_med_med', 'u_min', 'u_max', 'u_med', 'u_med_med'] 
    
    df_data = pd.read_csv('mini_out.csv') 
    df_data.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total'] 
    
    above = df_data['c_total'] > df_med['c_med'] 
    
    #print above_median 
    
    above.to_csv('above.csv', index=None, header=None) 
    
    df_above = pd.readcsv('above_median.csv') 
    df_above.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total'] 
    
    #Percentage block should come here 
    

    編輯:在單獨列的值的情況下,qcut是最簡單的解決方案。但是當涉及到使用來自兩個不同列的兩個值時,如何在熊貓中實現這一點?

    for row in csv.reader(inp): 
         if int(row[1])>(0.80*max_user) and int(row[2])>(0.80*max_key): 
          val='highly active' 
         elif int(row[1])>=(0.60*max_user) and int(row[2])<=(0.60*max_key): 
          val='active' 
         elif int(row[1])<=(0.40*max_user) and int(row[2])>=(0.40*max_key): 
          val='event based' 
         elif int(row[1])<(0.20*max_user) and int(row[2])<(0.20*max_key): 
          val ='situational' 
         else: 
          val= 'viewers' 
    

    回答

    1

    假設你有以下的DF:

    In [7]: df1 
    Out[7]: 
        date_count all_hours c_min c_max c_med c_med_med u_min u_max u_med u_med_med 
    0   2   12 2309 19072 12515  13131 254 785 686  751 
    
    In [8]: df2 
    Out[8]: 
          date startTime endTime day c_total u_total 
    0 2004-01-05 22:00:00 23:00:00 Mon 18944  790 
    1 2004-01-05 23:00:00 00:00:00 Mon 17534  750 
    2 2004-01-06 00:00:00 01:00:00 Tue 17262  747 
    3 2004-01-06 01:00:00 02:00:00 Tue 19072  777 
    4 2004-01-06 02:00:00 03:00:00 Tue 18275  785 
    5 2004-01-06 03:00:00 04:00:00 Tue 13589  757 
    6 2004-01-06 04:00:00 05:00:00 Tue 16053  735 
    7 2004-01-06 05:00:00 06:00:00 Tue 11440  636 
    8 2004-01-06 06:00:00 07:00:00 Tue  5972  513 
    9 2004-01-06 07:00:00 08:00:00 Tue  3424  382 
    10 2004-01-06 08:00:00 09:00:00 Tue  2696  303 
    11 2004-01-06 09:00:00 10:00:00 Tue  2350  262 
    12 2004-01-06 10:00:00 11:00:00 Tue  2309  254 
    

    通過門檻分開的(你可以用相同的長度或標量值比較兩個系列 - 我想你會分開你的第二個數據集,它(c_med列)從第一你的第一個數據集的比較標量值:

    In [22]: above = df2[df2.c_total > df1.ix[0, 'c_med']] 
    
    In [23]: above 
    Out[23]: 
         date startTime endTime day c_total u_total 
    0 2004-01-05 22:00:00 23:00:00 Mon 18944  790 
    1 2004-01-05 23:00:00 00:00:00 Mon 17534  750 
    2 2004-01-06 00:00:00 01:00:00 Tue 17262  747 
    3 2004-01-06 01:00:00 02:00:00 Tue 19072  777 
    4 2004-01-06 02:00:00 03:00:00 Tue 18275  785 
    5 2004-01-06 03:00:00 04:00:00 Tue 13589  757 
    6 2004-01-06 04:00:00 05:00:00 Tue 16053  735 
    

    可以使用qcut()方法,以分類數據:

    In [29]: df2['cat'] = pd.qcut(df2.c_total, 
        ....:      q=[0, .2, .4, .6, .8, 1.], 
        ....:      labels=['viewers','event based','situational','active','highly active']) 
    
    In [30]: df2 
    Out[30]: 
          date startTime endTime day c_total u_total   cat 
    0 2004-01-05 22:00:00 23:00:00 Mon 18944  790 highly active 
    1 2004-01-05 23:00:00 00:00:00 Mon 17534  750   active 
    2 2004-01-06 00:00:00 01:00:00 Tue 17262  747   active 
    3 2004-01-06 01:00:00 02:00:00 Tue 19072  777 highly active 
    4 2004-01-06 02:00:00 03:00:00 Tue 18275  785 highly active 
    5 2004-01-06 03:00:00 04:00:00 Tue 13589  757 situational 
    6 2004-01-06 04:00:00 05:00:00 Tue 16053  735 situational 
    7 2004-01-06 05:00:00 06:00:00 Tue 11440  636 situational 
    8 2004-01-06 06:00:00 07:00:00 Tue  5972  513 event based 
    9 2004-01-06 07:00:00 08:00:00 Tue  3424  382 event based 
    10 2004-01-06 08:00:00 09:00:00 Tue  2696  303  viewers 
    11 2004-01-06 09:00:00 10:00:00 Tue  2350  262  viewers 
    12 2004-01-06 10:00:00 11:00:00 Tue  2309  254  viewers 
    

    檢查:

    In [32]: df2.assign(pct=df2.c_total/df2.c_total.max())[['c_total','pct','cat']] 
    Out[32]: 
        c_total  pct   cat 
    0  18944 0.993289 highly active 
    1  17534 0.919358   active 
    2  17262 0.905096   active 
    3  19072 1.000000 highly active 
    4  18275 0.958211 highly active 
    5  13589 0.712510 situational 
    6  16053 0.841705 situational 
    7  11440 0.599832 situational 
    8  5972 0.313129 event based 
    9  3424 0.179530 event based 
    10  2696 0.141359  viewers 
    11  2350 0.123217  viewers 
    12  2309 0.121068  viewers 
    
    +0

    太謝謝你了。讓我把整個代碼放在一起,並嘗試 –

    +0

    謝謝你的最佳和簡單的解決方案。我是否也可以要求'qcut'有兩個值?請檢查問題的編輯部分。再次感謝 ! –

    +0

    錯誤:「追溯(最近一次調用最後一次): 文件」class_med.py「,第13行,在 df2 ['cat'] = pd.qcut(df2.c_total,q = [0,.4,。 7,1。],標籤= ['未填充','相當填充','填充','高度填充']) 文件「/usr/local/lib/python2.7/dist-packages/pandas/tools/tile.py 「,第173行,在qcut precision = precision,include_lowest = True) 文件」/usr/local/lib/python2.7/dist-packages/pandas/tools/tile.py「,第217行,在_bins_to_cuts中 raise ValueError('Bin標籤必須小於' ValueError:Bin標籤必須比bin邊緣的數量少一個 –