2014-10-07 86 views
0

請幫忙!我已經嘗試了不同的東西/軟件包,編寫一個程序,它接受4個輸入並根據來自csv文件的輸入組合返回組的寫作分數統計。這是我的第一個項目,所以我會很感激任何見解/提示/提示!csv文件中的Python條件過濾

這裏是CSV樣品(有200行總數):

id gender ses schtyp prog  write 
70 male low public general  52 
121 female middle public vocation 68 
86 male high public general  33 
141 male high public vocation 63  
172 male middle public academic 47 
113 male middle public academic 44 
50 male middle public general  59 
11 male middle public academic 34  
84 male middle public general  57  
48 male middle public academic 57  
75 male middle public vocation 60  
60 male middle public academic 57 

這是我到目前爲止有:

import csv 
import numpy 
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file 
header=csv_file_object.next() #skips header 
data=[] #loads data into array for processing 
for row in csv_file_object: 
    data.append(row) 
data=numpy.array(data) 

#asks for inputs 
gender=raw_input('Enter gender [male/female]: ') 
schtyp=raw_input('Enter school type [public/private]: ') 
ses=raw_input('Enter socioeconomic status [low/middle/high]: ') 
prog=raw_input('Enter program status [general/vocation/academic: ') 

#makes them lower case and strings 
prog=str(prog.lower()) 
gender=str(gender.lower()) 
schtyp=str(schtyp.lower()) 
ses=str(ses.lower()) 

我所缺少的是如何篩選,只得到統計爲特定的組。例如,假設我輸入了男性,公衆,中級和學術 - 我想要獲得該子集的平均寫作分數。我嘗試了來自熊貓的groupby功能,但是這隻能讓你獲得廣泛羣體的統計數據(例如公共vs私人)。我也嘗試了熊貓的DataFrame,但是這隻能讓我過濾一個輸入,並不確定如何獲得寫作分數。任何提示將不勝感激!

+0

從這個[段]讀取(http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)起,看看你的身體情況如何,基本上是你問可以做 – EdChum 2014-10-07 14:12:18

+0

看起來像一個典型的布爾索引數據框中的多個列的情況。你可以嘗試下面列出的方法[這裏](http://stackoverflow.com/questions/8916302/selecting-across-multiple-columns-with-python-pandas) – 2014-10-07 17:57:02

回答

1

Ramon達成一致的子集funcitonality,大熊貓肯定是要走的路,有着非同一般的過濾/子設置功能一旦你習慣了它。但是,首先將頭部包裹起來可能很困難(或者至少對我來說是這樣!),所以我從一些舊代碼中找到了一些你需要的子設置的例子。下面的變量itu是隨着時間的推移在不同國家的數據的熊貓數據幀。

# Subsetting by using True/False: 
subset = itu['CntryName'] == 'Albania' # returns True/False values 
itu[subset] # returns 1x144 DataFrame of only data for Albania 
itu[itu['CntryName'] == 'Albania'] # one-line command, equivalent to the above two lines 

# Pandas has many built-in functions like .isin() to provide params to filter on  
itu[itu.cntrycode.isin(['USA','FRA'])] # returns where itu['cntrycode'] is 'USA' or 'FRA' 
itu[itu.year.isin([2000,2001,2002])] # Returns all of itu for only years 2000-2002 
# Advanced subsetting can include logical operations: 
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])] # Both of above at same time 

# Use .loc with two elements to simultaneously select by row/index & column: 
itu.loc['USA','CntryName'] 
itu.iloc[204,0] 
itu.loc[['USA','BHS'], ['CntryName', 'Year']] 
itu.iloc[[204, 13], [0, 1]] 

# Can do many operations at once, but this reduces "readability" of the code 
itu[itu.cntrycode.isin(['USA','FRA']) & 
    itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']] 

# Finally, if you're comfortable with using map() and list comprehensions, 
you can do some advanced subsetting that includes evaluations & functions 
to determine what elements you want to select from the whole, such as all 
countries whose name begins with "United": 
criterion = itu['CntryName'].map(lambda x: x.startswith('United')) 
itu[criterion]['CntryName'] # gives us UAE, UK, & US 
+0

感謝TC Allen!有效。謝謝你給我一些關鍵的技巧和提示,因爲我剛開始學習這個程序:) – Mikaz 2014-10-07 22:18:52

0

看看pandas。我認爲這將縮短您的CSV解析工作,給你問...

import pandas as pd 
data = pd.read_csv('fileName.txt', delim_whitespace=True) 

#get all of the male students 
data[data['gender'] == 'male'] 
+0

感謝您的提示! – Mikaz 2014-10-07 22:19:14