如何獲取熊貓數據框中的行並轉換爲列的值？

我不知道如何真正精確地描述這個問題，所以我會在下面添加更多的細節並給出一個可重複的例子。如何獲取熊貓數據框中的行並轉換爲列的值？

基本上，我在熊貓數據框中有兩列和許多行，我希望能夠做一個轉換，其中我構建了一個新列，指示給定單位至少存在一個值。

例如，假設我有一個由兩列組成的熊貓數據框：學生和班級。假設我也有一本將每個類映射到一個主題的詞典。我想創建一個新的數據框，每個主題有一列用於studentid和一列。每個專業的專欄都會告訴我，學生是否至少參加過一門課（因此決賽桌在學生水平上是唯一的）。例如：

import pandas as pd 
s = {'student_id' : pd.Series(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']), 
    'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry', 'Algebra', 
          'Intro to Java', 'Chinese 101'])} 
c = {'subject' : pd.Series(['Math', 'Math', 'Math', 'CS', 'Science', 'Science', 'CS', 'Languages']), 
    'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry', 
          'Intro to Java', 'Chinese 101'])} 
students = pd.DataFrame(s, columns = ['student_id', 'classes'])

這段代碼的輸出是（抱歉不知道如何在StackOverflow的表，所以我只是把它作爲代碼）。

students 

student_id classes 
0 A  Algebra 
1 A  Geometry 
2 A  Topology 
3 B  Intro to Python 
4 B  Biology 
5 B  Chemistry 
6 C  Algebra 
7 C  Intro to Java 
8 C  Chinese 101 

classes 

subject   classes 
0 Math   Algebra 
1 Math   Geometry 
2 Math   Topology 
3 CS   Intro to Python 
4 Science  Biology 
5 Science  Chemistry 
6 CS   Intro to Java 
7 Languages Chinese 101

現在，我想創建一個新的數據幀，它基本上students數據幀它增加了新列在classes數據幀每個主題的轉變。更確切地說，我想要一個新的數據框，或許標題爲student_classes在student_id級別是唯一的，並且如果主題在該主題中至少有一門課程，則在該欄目中的值爲1。下面的這個例子，我想：

student_id Math CS Science Languages 
0 A  1  0  0   0 
1 B  0  1  1   0 
2 C  1  1  0   1

這裏是我做了什麼，解決了這個特殊的例子。問題是我的實際數據與學生無關，而數據幀要大得多，這使得以下解決方案非常緩慢並且內存密集。事實上，我的iPython Notebook在我的大表上返回一個內存錯誤。

所以，我實際上做的是創建詞典的詞典

classes_subject_dict={'Math': {'Algebra':1, 
           'Geometry':1, 
           'Topology':1, 
           }, 
         'CS': {'Intro to Python':1, 
          'Intro to Java':1, 
          }, 
         'Science':{'Biology':1, 
           'Chemistry':1, 
           }, 
         'Languages':{'Chinese 101':1 
            } 
        }

然後，我通過在字典中鍵的外觀和使用方法map（功能？我不知道是什麼技術術語是此處）至1的值映射到由所述對象中定義的列，如果適當的類出現：然後

for key in classes_subject_dict.keys(): 
    students[key]=students.classes.map(classes_subject_dict[key])

，我採取每一列內的最大值，刪除classes柱，然後下降重複讓我的決賽桌

for key in classes_subject_dict.keys(): 
    students[key]=students.groupby(['student_id'])[key].transform(max) 

students = students.drop('classes', 1) 
students = students.drop_duplicates() 
students = students.fillna(0) 

students 

    student_id CS Languages Math Science 
0 A   0 0   1  0 
3 B   1 0   0  1 
6 C   1 1   1  0

再次，這非常適用於這個特定的簡單的例子，但我的實際數據是既在長度和寬度方面非常非常大。雖然我的實際數據與學生沒有任何關係，但類似的描述應該是我有300個「科目」和數以十萬計的「學生」。我注意到使用map方法實際上減慢了我的代碼速度，我想知道是否有更高效的方法來做到這一點。

預先感謝您的幫助！

來源

2016-01-27 Vincent

有一個回答問題，與此非常相似。讓我找到它... – Kartik

在這裏，找到它︰http://stackoverflow.com/questions/33553765/generate-new-dataframe-with-values-in-old-dataframe-as-new-features-in-python – Kartik

您可以使用merge，crosstab然後astype：

import pandas as pd 
import pandas as pd 
s = {'student_id' : pd.Series(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']), 
    'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry', 'Algebra', 
          'Intro to Java', 'Chinese 101'])} 
c = {'subject' : pd.Series(['Math', 'Math', 'Math', 'CS', 'Science', 'Science', 'CS', 'Languages']), 
    'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry', 
          'Intro to Java', 'Chinese 101'])} 
students = pd.DataFrame(s, columns = ['student_id', 'classes']) 
classes = pd.DataFrame(c, columns = ['subject', 'classes']) 
print students 
    student_id   classes 
0   A   Algebra 
1   A   Geometry 
2   A   Topology 
3   B Intro to Python 
4   B   Biology 
5   B  Chemistry 
6   C   Algebra 
7   C Intro to Java 
8   C  Chinese 101 

print classes 
    subject   classes 
0  Math   Algebra 
1  Math   Geometry 
2  Math   Topology 
3   CS Intro to Python 
4 Science   Biology 
5 Science  Chemistry 
6   CS Intro to Java 
7 Languages  Chinese 101

df = pd.merge(students, classes, on=['classes']) 
print df 
    student_id   classes subject 
0   A   Algebra  Math 
1   C   Algebra  Math 
2   A   Geometry  Math 
3   A   Topology  Math 
4   B Intro to Python   CS 
5   B   Biology Science 
6   B  Chemistry Science 
7   C Intro to Java   CS 
8   C  Chinese 101 Languages 

df = pd.crosstab(df['student_id'], df['subject']) 
print df 
subject  CS Languages Math Science 
student_id        
A   0   0  3  0 
B   1   0  0  2 
C   1   1  1  0 

df = (df > 0) 
print df 
subject  CS Languages Math Science 
student_id         
A   False  False True False 
B   True  False False True 
C   True  True True False 
df = (df > 0).astype(int) 
print df 
subject  CS Languages Math Science 
student_id        
A   0   0  1  0 
B   1   0  0  1 
C   1   1  1  0

來源

2016-01-27 08:08:50 jezrael

感謝jezrael的幫助！也許一個愚蠢的問題，但我如何保持student_id列？例如，我想只做'df ['student_id']'，但在交叉表之後不再有效 – Vincent

如何獲取熊貓數據框中的行並轉換爲列的值？

回答

相關問題