2015-10-23 98 views
0

我有列中的數據集熊貓數據框查找具有特定列值的所有行?

1445544152817 SEND_MSG 123 
1445544152817 SEND_MSG 123 
1445544152829 SEND_MSG 135 
1445544152829 SEND_MSG 135 
1445544152830 SEND_MSG 135 
1445544152830 GET_QUEUE 12 
1445544152830 SEND_MSG 136 
1445544152830 SEND_MSG 136 
1445544152892 GET_LATEST_MSG_DELETE 26 

我的名字列:timestamp類型和response_time 我做的:

df = read_csv(output_path,names=header_row, sep=' ') 

,並以其優良的,當我輸出的DF它給我的所有值文件。 問題?當我做

df = df[df['type'] == 'SEND_MSG'] 

該df有0行!怎麼來的?這不是真的,因爲文件和df有型行= SEND_MSG

這裏是我的程序:

warm_up = 100 
cool_down = 100 


def refine(df): 
    start_time = np.min(df['timestamp']) 
    #print start_time.columns[0] 
    end_time = np.max(df['timestamp']) 
    #print end_time.columns[0] 
    new_start_time = start_time + (10 * 1000) 
    #new_end_time = 0 
    df = df[df['timestamp'] > new_start_time] 
    #df = df[df['timestamp'] < new_end_time] 
    return df 


def ci(data): 
    n, min_max, mean, var, skew, kurt = scipy.stats.describe(data) 
    std = math.sqrt(var) 
    error_margin = 1.96 * (std/np.sqrt(n)) 
    l, h = mean - error_margin, mean + error_margin 
    return (l, h) 


MSG_TYPE = { 
    'SEND_MSG', 'GET_QUEUE', 'GET_LATEST_MSG_DELETE' 
} 
COLORS = ['r','g','b'] 


def main(): 
    output_path = "/Users/ramapriyasridharan/Documents/SystemsLabExperiements/merged.txt" 

    xlabel = "Time in minutes" 
    ylabel = "Response time in ms" 
    header_row = ['timestamp','type','response_time'] 
    df = read_csv(output_path,names=header_row, sep=' ') 
    #df = refine(df) 
    min_timestamp = np.min(df['timestamp']) 




    df['timestamp'] = df['timestamp'] - min_timestamp 
    # convert time to minutes 
    df['timestamp'] = np.round(df['timestamp']/60000) 
    # filter all outlier above 70 seconds reponse times 
    #df = df[df['response_time'] < 70 ] 
    df['type'] = df['type'] 
    i = 0 
    print df['type'] 
    for msg in MSG_TYPE: 
     print msg 
     df = df[df['type'] == msg] 
     print len(df) 
     response_mean = np.mean(df['response_time']) 
     response_median = np.median(df['response_time']) 
     response_std = np.std(df['response_time']) 
     l,h = ci(df['response_time']) 
     max_resp = np.max(df['response_time']) 
     print "For msg_type = %s maximum response time %s"%(msg,max_resp) 
     print "For msg_type = %s Response time avg = %.3f +- %.3f std = %.3f and Median = %.3f "%(msg,np.round(response_mean,3),np.round(h-response_mean,3),np.round(response_median,3),np.round(response_std,3)) 
     # round to nearest minute 
     #find number of timestamps greater than 100 
     #print df[df['response_time'] > 70] 
     grp_by_timestamp_df = df.groupby('timestamp') 
     mean_resp_per_min = grp_by_timestamp_df['response_time'].mean() 
     #print mean_resp_per_min[0:36] 
     plt.plot(mean_resp_per_min, 'x-', color=COLORS[i], label='%s requests'%msg, lw=0.5) 
     i += 1 

    response_mean = np.mean(df['response_time']) 
    response_median = np.median(df['response_time']) 
    response_std = np.std(df['response_time']) 
    l,h = ci(df['response_time']) 
    max_resp = np.max(df['response_time']) 
    print "For msg_type = %s maximum response time %s"%('ALL',max_resp) 
    print "For msg_type = %s Response time avg = %.3f +- %.3f std = %.3f and Median = %.3f "%('ALL',np.round(response_mean,3),np.round(h-response_mean,3),np.round(response_median,3),np.round(response_std,3)) 
    # round to nearest minute 
    #find number of timestamps greater than 100 
    #print df[df['response_time'] > 70] 
    grp_by_timestamp_df = df.groupby('timestamp') 
    mean_resp_per_min = grp_by_timestamp_df['response_time'].mean() 
    #print mean_resp_per_min[0:36] 

    plt.plot(mean_resp_per_min, 'x-', color='k', label='ALL requests', lw=0.5) 
    plt.xlim(xmin=0.0,xmax=30) 
    plt.ylim(ymin=0.0,ymax=20) 
    plt.xlabel(xlabel) 
    plt.ylabel(ylabel) 
    plt.legend(loc="best", fancybox=True, framealpha=0.5) 
    plt.grid() 
    plt.show() 

    #print df['response_time'] 

編輯:我發現了這個問題,但沒有解決

我的實際數據的模樣我之前粘貼,但是當我把它放在一個數據幀它看起來像這樣,用空格型

22059 GET_LATEST_MSG_DELETE 
22060 GET_LATEST_MSG_DELETE 
22061 GET_LATEST_MSG_DELETE 
22062 GET_LATEST_MSG_DELETE 
22063    GET_QUEUE 
22064    GET_QUEUE 
22065    GET_QUEUE 
22066    GET_QUEUE 
22067    GET_QUEUE 
22068    GET_QUEUE 
22069    GET_QUEUE 
22070    GET_QUEUE 
22071    GET_QUEUE 
22072 GET_LATEST_MSG_DELETE 
22073 GET_LATEST_MSG_DELETE 
22074 GET_LATEST_MSG_DELETE 
22075 GET_LATEST_MSG_DELETE 
22076 GET_LATEST_MSG_DELETE 
22077 GET_LATEST_MSG_DELETE 
22078 GET_LATEST_MSG_DELETE 
22079 GET_LATEST_MSG_DELETE 
22080 GET_LATEST_MSG_DELETE 
22081 GET_LATEST_MSG_DELETE 
22082 GET_LATEST_MSG_DELETE 

有一個在get_queue前面領先的空間之前,我怎麼這麼我認爲這個空間不存在於我的實際數據中

編輯:問題是這樣的事實,類型中有可變大小元素,我該如何解決它?

+0

對不起的你的第一個文本塊是你說你的專欄有所有這些信息:'1445544152817 SEND_MSG 123'這不是3專欄NS?如果是這樣,那麼它應該很明顯爲什麼'df = df [df ['type'] =='SEND_MSG']'不起作用,您是否在尋找'df = df [df ['type']。str.contains ( 'SEND_MSG')]'? – EdChum

+1

你可以檢查'類型'列值中的尾部和前導空格嗎? – Zero

+0

@JohnGalt我該怎麼做?你的意思是視覺? – LoveMeow

回答

2

既然你正在尋找的只是一個單一值(SEND_MSG),你可以這樣做:

import pandas as pd 

df = pd.read_clipboard() 
df.columns = ['timestamp', 'type', 'response_time'] 
print df.loc[df['type'] == 'SEND_MSG'] 

輸出:

 timestamp  type response_time 
0 1445544152817 SEND_MSG   123 
1 1445544152829 SEND_MSG   135 
2 1445544152829 SEND_MSG   135 
3 1445544152830 SEND_MSG   135 
5 1445544152830 SEND_MSG   136 
6 1445544152830 SEND_MSG   136 

的重要行是:

df.loc[df['type'] == 'SEND_MSG'] 
+0

它告訴我有5列,並給我一個錯誤,當我按照你的方式分配標題時,但在視覺上我只能看到3列,空間可能不會統一,有沒有辦法糾正? – LoveMeow

+0

我複製了粘貼到原始問題中的數據框,包括第一列之前的前導空格。在分配列之前,先執行一個'print df'並查看數據框的外觀。如果您的'read_csv'正在工作,請不要擔心我擁有的'read_clipboard'。相反,調整'df = df [df ['type'] =='SEND_MSG']'行到我提到的 – Andy

+0

以上我知道這個問題,我更新了問題 – LoveMeow

相關問題