如何壓扁單個熊貓數據框並將它們疊加以實現新的數據框？

我有一個函數，它接收特定年份的數據並返回一個數據幀。如何壓扁單個熊貓數據框並將它們疊加以實現新的數據框？

例如：

year fruit license  grade 
1946 apple  XYZ  1 
1946 orange  XYZ  1 
1946 apple  PQR  3 
1946 orange  PQR  1 
1946 grape  XYZ  2 
1946 grape  PQR  1 
.. 
2014 grape  LMN  1

注： 1）特定的許可值將只存在於一個特定的一年只有一次特定的水果（例如，XYZ只供。 1946年，蘋果，橙和葡萄只有一次）。 2）等級值是分類的。

我意識到下面的功能並不是非常有效的達到預期的目標，但這是我目前的工作。

def func(df, year): 
    #1. Filter out only the data for the year needed 

    df_year=df[df['year']==year] 
    ''' 
    2. Transform DataFrame to the form: 
       XYZ PQR ..  LMN 
    apple  1  3    1 
    orange  1  1    3 
    grape  2  1    1 
    Note that 'LMN' is just used for representation purposes. 
    It won't logically appear here because it can only appear for the year 2014. 
    ''' 
    df_year = df_year.pivot(index='fruit',columns='license',values='grade')  

    #3. Remove all fruits that have ANY NaN values 
    df_year=df_year.dropna(axis=1, how="any") 

    #4. Some additional filtering 

    #5. Function to calculate similarity between fruits 
    def similarity_score(fruit1, fruit2): 
     agreements=np.sum( ((fruit1 == 1) & (fruit2 == 1)) | \ 
     ( (fruit1 == 3) & (fruit2 == 3))) 

     disagreements=np.sum( ((fruit1 == 1) & (fruit2 == 3)) |\ 
     ( (fruit1 == 3) & (fruit2 == 1))) 

     return (((agreements-disagreements) /float(len(fruit1))) +1)/2) 

    #6. Create Network dataframe 
    network_df=pd.DataFrame(columns=['Source','Target','Weight']) 

    for i,c in enumerate(combinations(df_year,2)): 
     c1=df[[c[0]]].values.tolist() 
     c2=df[[c[1]]].values.tolist() 
     c1=[item for sublist in c1 for item in sublist] 
     c2=[item for sublist in c2 for item in sublist] 
     network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)] 

    return network_df

運行上面給出：

df_1946=func(df,1946) 
df_1946.head() 

Source Target Weight 
Apple  Orange  0.6 
Apple  Grape  0.3 
Orange Grape  0.7

我想變平以上，以單行：

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 
1946  0.6    0.3   0.7

注意上面不會有3列，但實際上各地5000列。

最後，我想堆棧轉換數據框行得到的東西，如：

df_all_years

 (Apple,Orange) (Apple,Grape) (Orange,Grape) 
1946  0.6    0.3   0.7 
1947  0.7    0.25   0.8 
.. 
2015  0.75   0.3   0.65

什麼是做到這一點的最好方法是什麼？

來源

2017-08-19 Melsauce

'（蘋果，橙）' - 它是一個字符串或一個元組？ – MaxU

元組。你可以使用任何你喜歡的東西，只要有一種方法可以告訴特定單元格代表什麼組合。 – Melsauce

我會重新排列計算有點不同。而是循環多年來的：

for year in range(1946, 2015): 
    partial_result = func(df, year)

然後連接部分結果，可以通過調用df.groupby(...)之前做盡可能多的工作，儘可能減少對整個數據幀，df，得到更好的性能。此外，如果您可以使用sum和count等內置聚合器表示計算，則與使用groupby/apply的自定義函數相比，可以更快地完成計算。

import itertools as IT 
import numpy as np 
import pandas as pd 
np.random.seed(2017) 

def make_df(): 
    N = 10000 
    df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N), 
         'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N), 
         'year': np.random.choice(range(1946,1950), size=N)}) 
    df['manufacturer'] = (df['year'].astype(str) + '-' 
          + df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str)) 
    df = df.sort_values(by=['year']) 
    return df 

def similarity_score(df): 
    """ 
    Compute the score between each pair of columns in df 
    """ 
    agreements = {} 
    disagreements = {} 
    for col in IT.combinations(df,2): 
     fruit1 = df[col[0]].values 
     fruit2 = df[col[1]].values 
     agreements[col] = (((fruit1 == 1) & (fruit2 == 1)) 
          | ((fruit1 == 3) & (fruit2 == 3))) 
     disagreements[col] = (((fruit1 == 1) & (fruit2 == 3)) 
           | ((fruit1 == 3) & (fruit2 == 1))) 
    agreements = pd.DataFrame(agreements, index=df.index) 
    disagreements = pd.DataFrame(disagreements, index=df.index) 
    numerator = agreements.astype(int)-disagreements.astype(int) 
    grouped = numerator.groupby(level='year') 
    total = grouped.sum() 
    count = grouped.count() 
    score = ((total/count) + 1)/2 
    return score 

df = make_df() 
df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit']) 
df2 = df2.dropna(axis=0, how="any") 

print(similarity_score(df2))

產生

  Grape Orange   
     Apple  Apple  Grape 
year        
1946 0.629111 0.650426 0.641900 
1947 0.644388 0.639344 0.633039 
1948 0.613117 0.630566 0.616727 
1949 0.634176 0.635379 0.637786

來源

2017-08-19 21:23:45 unutbu

我已編輯的問題，並確定雙方DF和FUNC鍵，這樣就可以讓正在發生的事情的一個更好的主意。樂意提供更多信息。 – Melsauce

這裏做一個熊貓常規轉動的表，你指的是這樣的一種方式;而它可以處理大約5,000列 - 由兩個最初分開的類組合而成 - 足夠快（瓶頸步驟在我的四核MacBook上花費了大約20秒），對於大得多的縮放，確實有更快的策略。這個例子中的數據非常稀少（5K列，來自70行數[1947-2016]的5K隨機樣本），因此執行時間可能會延長數秒，並且數據幀更完整。

from itertools import chain 
import pandas as pd 
import numpy as np 
import random # using python3 .choices() 
import re 

# Make bivariate data w/ 5000 total combinations (1000x5 categories) 
# Also choose 5,000 randomly; some combinations may have >1 values or NaN 
random_sample_data = np.array(
    [random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] + 
        ['of Fruit' + str(i) for i in range(1000)], 
        k=5000), 
    random.choices(['Grapes', 'Are Purple', 'And Make Wine', 
        'From the Yeast', 'That Love Sugar'], 
        k=5000), 
    [random.random() for _ in range(5000)]] 
).T 
df = pd.DataFrame(random_sample_data, columns=[ 
        "Source", "Target", "Weight"]) 
df['Year'] = random.choices(range(1947, 2017), k=df.shape[0]) 

# Three views of resulting df in jupyter notebook: 
df 
df[df.Year == 1947] 
df.groupby(["Source", "Target"]).count().unstack()

爲了展平分組按年數據，因爲GROUPBY需要一個功能應用，您可以使用臨時DF中介：

推動所有data.groupby("Year")成單個行，但每個「Target」+「Source」（稍後擴展）以及「Weight」兩列分別具有不同的數據框。
使用zip和pd.core.reshape.util.cartesian_product創建一個空的適當形狀的支點DF這將是最後的表，從temp_df產生。

例如，

df_temp = df.groupby("Year").apply(
    lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)], 
          columns=["Target", "Source", "Weight"]) 
).sort_index() 
df_temp.index = df_temp.index.droplevel(1) # reduce MultiIndex to 1-d 

# Predetermine all possible pairwise column category combinations 
product_ts = [*zip(*(pd.core.reshape.util.cartesian_product(
    [df.Target.unique(), df.Source.unique()]) 
))] 

ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts] 

ts_combinations

最後，使用簡單的，嵌套迭代（再次，不是最快的，但pd.DataFrame.iterrows可能有助於加快速度，如圖所示）。因爲更換隨機抽樣的，我不得不處理多個值，所以你可能會想刪除第二個for循環，這是步驟，其中三個獨立dataframes是，每年可爲以下的條件，因此壓縮到單行所有細胞通過pivoted（「Weight」）x（「Target」 - 「Source」）關係。

df_pivot = pd.DataFrame(np.zeros((70, 5000)), 
         columns=ts_combinations) 
df_pivot.index = df_temp.index 

for year, values in df_temp.iterrows(): 

    for (target, source, weight) in zip(*values): 

     bivar_pair = str(target + ' ' + source) 
     curr_weight = df_pivot.loc[year, bivar_pair] 

     if curr_weight == 0.0: 
      df_pivot.loc[year, bivar_pair] = [weight] 
     # append additional values if encountered 
     elif type(curr_weight) == list: 
      df_pivot.loc[year, bivar_pair] = str(curr_weight + 
               [weight])

# Spotcheck: 
# Verifies matching data in pivoted table vs. original for Target+Source 
# combination "And Make Wine of Fruit614" across all 70 years 1947-2016 
df 
df_pivot['And Make Wine of Fruit614'] 
df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')]

來源

2017-08-20 05:51:21 johnxcollins

如何壓扁單個熊貓數據框並將它們疊加以實現新的數據框？

回答

相關問題