2017-07-17 243 views
1

熊貓新手,抱歉,如果解決方案很明顯。熊貓羣大熊貓字典

我有一個數據幀(見下文)與不同的電影場景,對於電影中的場景

import pandas as pd 
data = [{'movie' : 'movie_X', 'scene' : '1', 'environment' : 'home'}, 
     {'movie' : 'movie_X', 'scene' : '2', 'environment' : 'car'}, 
     {'movie' : 'movie_X', 'scene' : '3', 'environment' : 'home'}, 
     {'movie' : 'movie_Y', 'scene' : '1', 'environment' : 'home'}, 
     {'movie' : 'movie_Y', 'scene' : '2', 'environment' : 'office'}, 
     {'movie' : 'movie_Z', 'scene' : '1', 'environment' : 'boat'}, 
     {'movie' : 'movie_Z', 'scene' : '2', 'environment' : 'beach'}, 
     {'movie' : 'movie_Z', 'scene' : '3', 'environment' : 'home' }] 
myDF = pd.DataFrame(data) 

環境。在這種情況下,電影有多個流派,他們屬於哪個。我有一本字典(下),說明該類型屬於

genreDict = {'movie_X' : ['romance', 'action'], 
      'movie_Y' : ['comedy', 'romance', 'action'], 
      'movie_Z' : ['horror', 'thriller', 'romance']} 

我想是myDF組通過這本字典每部電影,特別是能夠告訴的次數特定的環境特定類型止跌回升(例如,在類型恐怖中,'船'被計數一次,'海灘'被計數一次,'家'被計數一次)。什麼是最好的和最有效的方式去做這件事?我試圖映射字典數據框,然後由列表分組:

myDF['genres'] = myDF['movie'].map(genreDict) 

將返回:

movie scene environment    genres 
0 movie_X  1  home   [romance, action] 
1 movie_X  2   car   [romance, action] 
2 movie_X  3  home   [romance, action] 
3 movie_Y  1  home [comedy, romance, action] 
4 movie_Y  2  office [comedy, romance, action] 
5 movie_Z  1  boat [horror, thriller, romance] 
6 movie_Z  2  beach [horror, thriller, romance] 
7 movie_Z  3  home [horror, thriller, romance] 

但是,我得到了一個錯誤說列表是unhashable。希望你們都可以幫忙:)

+0

你可以發表你想要的數據集? – MaxU

回答

0

如果更大的數據幀速度是由listsnumpy.repeatnumpy.concatenateIndex.values使用numpy的重複行:

#get length of lists in column genres 
l = myDF['genres'].str.len() 
#convert column to numpy array 
vals = myDF['genres'].values 
#repeat index by lenghts 
idx = np.repeat(myDF.index, l) 
#expand rows by duplicated index values 
myDF = myDF.loc[idx] 
#flattening lists column 
myDF['genres'] = np.concatenate(vals) 
#default monotonic index (0,1,2...) 
myDF = myDF.reset_index(drop=True) 
print (myDF) 
    environment movie scene genres 
0   home movie_X  1 romance 
1   home movie_X  1 action 
2   car movie_X  2 romance 
3   car movie_X  2 action 
4   home movie_X  3 romance 
5   home movie_X  3 action 
6   home movie_Y  1 comedy 
7   home movie_Y  1 romance 
8   home movie_Y  1 action 
9  office movie_Y  2 comedy 
10  office movie_Y  2 romance 
11  office movie_Y  2 action 
12  boat movie_Z  1 horror 
13  boat movie_Z  1 thriller 
14  boat movie_Z  1 romance 
15  beach movie_Z  2 horror 
16  beach movie_Z  2 thriller 
17  beach movie_Z  2 romance 
18  home movie_Z  3 horror 
19  home movie_Z  3 thriller 
20  home movie_Z  3 romance 

然後用groupby和聚集size

df1 = df.groupby(['genres','environment']).size().reset_index(name='count') 
print (df1) 
     genres environment count 
0  action   car  1 
1  action  home  3 
2  action  office  1 
3  comedy  home  1 
4  comedy  office  1 
5  horror  beach  1 
6  horror  boat  1 
7  horror  home  1 
8 romance  beach  1 
9 romance  boat  1 
10 romance   car  1 
11 romance  home  4 
12 romance  office  1 
13 thriller  beach  1 
14 thriller  boat  1 
15 thriller  home  1 
2

非標量物體一般會造成熊貓問題。除此之外,您需要整理數據,以便您的後續步驟更輕鬆(表格結構上的主要操作通常定義在整潔的數據集上)。你需要一個數據集,你不需要在一行中列出所有流派,而是每個流派都有自己的行。

下面是可能的方式來實現這一目標之一:

genre_df = pd.DataFrame(myDF['movie'].map(genreDict).tolist()) 

df = myDF.join(genre_df.stack().rename('genre').reset_index(level=1, drop=True)) 
df 
Out: 
    environment movie scene  genre 
0  home movie_X  1 romance 
0  home movie_X  1 action 
1   car movie_X  2 romance 
1   car movie_X  2 action 
2  home movie_X  3 romance 
2  home movie_X  3 action 
3  home movie_Y  1 comedy 
3  home movie_Y  1 romance 
3  home movie_Y  1 action 
4  office movie_Y  2 comedy 
4  office movie_Y  2 romance 
4  office movie_Y  2 action 
5  boat movie_Z  1 horror 
5  boat movie_Z  1 thriller 
5  boat movie_Z  1 romance 
6  beach movie_Z  2 horror 
6  beach movie_Z  2 thriller 
6  beach movie_Z  2 romance 
7  home movie_Z  3 horror 
7  home movie_Z  3 thriller 
7  home movie_Z  3 romance 

一旦你有這樣的結構,它是組或跨容易得多製表你的數據:

df.groupby('genre').size() 
Out: 
genre 
action  5 
comedy  2 
horror  3 
romance  8 
thriller 3 
dtype: int64 

pd.crosstab(df['genre'], df['environment']) 
Out: 
environment beach boat car home office 
genre          
action   0  0 1  3  1 
comedy   0  0 0  1  1 
horror   1  1 0  1  0 
romance   1  1 1  4  1 
thriller   1  1 0  1  0 

這裏有一個Hadley Wickham的精彩閱讀:Tidy Data