2017-08-19 68 views
0

我有一個函數,它接收特定年份的數據並返回一個數據幀。如何壓扁單個熊貓數據框並將它們疊加以實現新的數據框?

例如:

DF

year fruit license  grade 
1946 apple  XYZ  1 
1946 orange  XYZ  1 
1946 apple  PQR  3 
1946 orange  PQR  1 
1946 grape  XYZ  2 
1946 grape  PQR  1 
.. 
2014 grape  LMN  1 

注: 1)特定的許可值將只存在於一個特定的一年只有一次特定的水果(例如,XYZ只供。 1946年,蘋果,橙和葡萄只有一次)。 2)等級值是分類的。

我意識到下面的功能並不是非常有效的達到預期的目標, 但這是我目前的工作。

def func(df, year): 
    #1. Filter out only the data for the year needed 

    df_year=df[df['year']==year] 
    ''' 
    2. Transform DataFrame to the form: 
       XYZ PQR ..  LMN 
    apple  1  3    1 
    orange  1  1    3 
    grape  2  1    1 
    Note that 'LMN' is just used for representation purposes. 
    It won't logically appear here because it can only appear for the year 2014. 
    ''' 
    df_year = df_year.pivot(index='fruit',columns='license',values='grade')  

    #3. Remove all fruits that have ANY NaN values 
    df_year=df_year.dropna(axis=1, how="any") 

    #4. Some additional filtering 

    #5. Function to calculate similarity between fruits 
    def similarity_score(fruit1, fruit2): 
     agreements=np.sum( ((fruit1 == 1) & (fruit2 == 1)) | \ 
     ( (fruit1 == 3) & (fruit2 == 3))) 

     disagreements=np.sum( ((fruit1 == 1) & (fruit2 == 3)) |\ 
     ( (fruit1 == 3) & (fruit2 == 1))) 

     return (((agreements-disagreements) /float(len(fruit1))) +1)/2) 

    #6. Create Network dataframe 
    network_df=pd.DataFrame(columns=['Source','Target','Weight']) 

    for i,c in enumerate(combinations(df_year,2)): 
     c1=df[[c[0]]].values.tolist() 
     c2=df[[c[1]]].values.tolist() 
     c1=[item for sublist in c1 for item in sublist] 
     c2=[item for sublist in c2 for item in sublist] 
     network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)] 

    return network_df 

運行上面給出:

df_1946=func(df,1946) 
df_1946.head() 

Source Target Weight 
Apple  Orange  0.6 
Apple  Grape  0.3 
Orange Grape  0.7 

我想變平以上,以單行:

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 
1946  0.6    0.3   0.7 

注意上面不會有3列,但實際上各地5000列。

最後,我想堆棧轉換數據框行得到的東西,如:

df_all_years

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 
1946  0.6    0.3   0.7 
1947  0.7    0.25   0.8 
.. 
2015  0.75   0.3   0.65 

什麼是做到這一點的最好方法是什麼?

+0

'(蘋果,橙)' - 它是一個字符串或一個元組? – MaxU

+0

元組。你可以使用任何你喜歡的東西,只要有一種方法可以告訴特定單元格代表什麼組合。 – Melsauce

回答

2

我會重新排列計算有點不同。 而是循環多年來的:

for year in range(1946, 2015): 
    partial_result = func(df, year) 

然後連接部分結果,可以通過調用df.groupby(...)之前做盡可能多的工作,儘可能減少對整個數據幀,df, 得到 更好的性能。此外,如果您可以使用sumcount等內置聚合器表示計算,則與使用groupby/apply的自定義函數相比,可以更快地完成計算。

import itertools as IT 
import numpy as np 
import pandas as pd 
np.random.seed(2017) 

def make_df(): 
    N = 10000 
    df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N), 
         'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N), 
         'year': np.random.choice(range(1946,1950), size=N)}) 
    df['manufacturer'] = (df['year'].astype(str) + '-' 
          + df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str)) 
    df = df.sort_values(by=['year']) 
    return df 

def similarity_score(df): 
    """ 
    Compute the score between each pair of columns in df 
    """ 
    agreements = {} 
    disagreements = {} 
    for col in IT.combinations(df,2): 
     fruit1 = df[col[0]].values 
     fruit2 = df[col[1]].values 
     agreements[col] = (((fruit1 == 1) & (fruit2 == 1)) 
          | ((fruit1 == 3) & (fruit2 == 3))) 
     disagreements[col] = (((fruit1 == 1) & (fruit2 == 3)) 
           | ((fruit1 == 3) & (fruit2 == 1))) 
    agreements = pd.DataFrame(agreements, index=df.index) 
    disagreements = pd.DataFrame(disagreements, index=df.index) 
    numerator = agreements.astype(int)-disagreements.astype(int) 
    grouped = numerator.groupby(level='year') 
    total = grouped.sum() 
    count = grouped.count() 
    score = ((total/count) + 1)/2 
    return score 

df = make_df() 
df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit']) 
df2 = df2.dropna(axis=0, how="any") 

print(similarity_score(df2)) 

產生

  Grape Orange   
     Apple  Apple  Grape 
year        
1946 0.629111 0.650426 0.641900 
1947 0.644388 0.639344 0.633039 
1948 0.613117 0.630566 0.616727 
1949 0.634176 0.635379 0.637786 
+0

我已編輯的問題,並確定雙方DF和FUNC鍵,這樣就可以讓正在發生的事情的一個更好的主意。樂意提供更多信息。 – Melsauce

1

這裏做一個熊貓常規轉動的表,你指的是這樣的一種方式;而它可以處理大約5,000列 - 由兩個最初分開的類組合而成 - 足夠快(瓶頸步驟在我的四核MacBook上花費了大約20秒),對於大得多的縮放,確實有更快的策略。這個例子中的數據非常稀少(5K列,來自70行數[1947-2016]的5K隨機樣本),因此執行時間可能會延長數秒,並且數據幀更完整。

from itertools import chain 
import pandas as pd 
import numpy as np 
import random # using python3 .choices() 
import re 

# Make bivariate data w/ 5000 total combinations (1000x5 categories) 
# Also choose 5,000 randomly; some combinations may have >1 values or NaN 
random_sample_data = np.array(
    [random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] + 
        ['of Fruit' + str(i) for i in range(1000)], 
        k=5000), 
    random.choices(['Grapes', 'Are Purple', 'And Make Wine', 
        'From the Yeast', 'That Love Sugar'], 
        k=5000), 
    [random.random() for _ in range(5000)]] 
).T 
df = pd.DataFrame(random_sample_data, columns=[ 
        "Source", "Target", "Weight"]) 
df['Year'] = random.choices(range(1947, 2017), k=df.shape[0]) 

# Three views of resulting df in jupyter notebook: 
df 
df[df.Year == 1947] 
df.groupby(["Source", "Target"]).count().unstack() 

enter image description here

爲了展平分組按年數據,因爲GROUPBY需要一個功能應用,您可以使用臨時DF中介:

  1. 推動所有data.groupby("Year")成單個行,但每個「Target」+「Source」(稍後擴展)以及「Weight」兩列分別具有不同的數據框。
  2. 使用zippd.core.reshape.util.cartesian_product創建一個空的適當形狀的支點DF這將是最後的表,從temp_df產生。

例如,

df_temp = df.groupby("Year").apply(
    lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)], 
          columns=["Target", "Source", "Weight"]) 
).sort_index() 
df_temp.index = df_temp.index.droplevel(1) # reduce MultiIndex to 1-d 

# Predetermine all possible pairwise column category combinations 
product_ts = [*zip(*(pd.core.reshape.util.cartesian_product(
    [df.Target.unique(), df.Source.unique()]) 
))] 

ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts] 

ts_combinations 

enter image description here

最後,使用簡單的,嵌套迭代(再次,不是最快的,但pd.DataFrame.iterrows可能有助於加快速度,如圖所示)。因爲更換隨機抽樣的,我不得不處理多個值,所以你可能會想刪除第二個for循環,這是步驟,其中三個獨立dataframes是,每年可爲以下的條件,因此壓縮到單行所有細胞通過pivoted(「Weight」)x(「Target」 - 「Source」)關係。

df_pivot = pd.DataFrame(np.zeros((70, 5000)), 
         columns=ts_combinations) 
df_pivot.index = df_temp.index 

for year, values in df_temp.iterrows(): 

    for (target, source, weight) in zip(*values): 

     bivar_pair = str(target + ' ' + source) 
     curr_weight = df_pivot.loc[year, bivar_pair] 

     if curr_weight == 0.0: 
      df_pivot.loc[year, bivar_pair] = [weight] 
     # append additional values if encountered 
     elif type(curr_weight) == list: 
      df_pivot.loc[year, bivar_pair] = str(curr_weight + 
               [weight]) 

enter image description here

# Spotcheck: 
# Verifies matching data in pivoted table vs. original for Target+Source 
# combination "And Make Wine of Fruit614" across all 70 years 1947-2016 
df 
df_pivot['And Make Wine of Fruit614'] 
df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')]