2016-09-22 57 views
1

我有這個數據幀稱爲數據:基於邏輯條件的熊貓DataFrame切片?

 Subjects Professor StudentID 
8  Chemistry  Jane  999 
1  Chemistry  Jane  3455 
0  Chemistry  Joseph  1234 
2  History  Jane  3455 
6  History  Smith  323 
7  History  Smith  999 
3 Mathematics  Doe  56767 
10 Mathematics Einstein  3455 
5  Physics Einstein  2834 
9  Physics  Smith  323 
4  Physics  Smith  999 

我想「至少有2班,2個或更多相同的學生教授」運行此查詢。所需的輸出

Smith: Physics, History, 323, 999 

我熟悉SQL,並可以很容易地做到這一點,但我仍然在Python初學者。如何在Python中實現這個輸出?另一種思路是將此數據框轉換爲SQL數據庫,並通過python擁有SQL接口來運行查詢。有沒有辦法做到這一點?

回答

2
students_and_subjects = df.groupby(
           ['Professor', 'Subjects'] 
          ).StudentID.nunique().ge(2) \ 
          .groupby(level='Professor').sum().ge(2) 

df[df.Professor.map(students_and_subjects)] 

enter image description here

+0

您可以包括片段打印所需的輸出? – GKS

1

解決方案與filtervalue_counts

df1 = df.groupby('Professor').filter(lambda x: (len(x.Subjects) > 1) & 
               ((x.StudentID.value_counts() > 1).sum() > 1)) 
print (df1) 
    Subjects Professor StudentID 
6 History  Smith  323 
7 History  Smith  999 
9 Physics  Smith  323 
4 Physics  Smith  999 

duplicated

:通過評論

df1 = df.groupby('Professor').filter(lambda x: (len(x.Subjects) > 1) & 
               (x.StudentID.duplicated().sum() > 1)) 
print (df1) 
    Subjects Professor StudentID 
6 History  Smith  323 
7 History  Smith  999 
9 Physics  Smith  323 
4 Physics  Smith  999 

編輯

您可以從自定義函數返回的自定義輸出,然後通過Series.dropna刪除NaN行:

df.StudentID = df.StudentID.astype(str) 

def f(x): 
    if (len(x.Subjects) > 1) & (x.StudentID.duplicated().sum() > 1): 
     return ', '.join((x.Subjects.unique().tolist() + x.StudentID.unique().tolist())) 

df1 = df.groupby('Professor').apply(f).dropna() 
df1 = df1.index.to_series() + ': ' + df1 
print (df1) 
Professor 
Smith Smith: History, Physics, 323, 999 
dtype: object 
+0

嗨,這樣做的工作,但有沒有辦法,我可以按照所需的格式 – GKS

+0

Ouch顯示,與以前相同的問題,但請檢查我的解決方案。謝謝。 – jezrael