我有一個函數,它接收特定年份的數據並返回一個數據幀。如何壓扁單個熊貓數據框並將它們疊加以實現新的數據框?
例如:
DF
year fruit license grade
1946 apple XYZ 1
1946 orange XYZ 1
1946 apple PQR 3
1946 orange PQR 1
1946 grape XYZ 2
1946 grape PQR 1
..
2014 grape LMN 1
注: 1)特定的許可值將只存在於一個特定的一年只有一次特定的水果(例如,XYZ只供。 1946年,蘋果,橙和葡萄只有一次)。 2)等級值是分類的。
我意識到下面的功能並不是非常有效的達到預期的目標, 但這是我目前的工作。
def func(df, year):
#1. Filter out only the data for the year needed
df_year=df[df['year']==year]
'''
2. Transform DataFrame to the form:
XYZ PQR .. LMN
apple 1 3 1
orange 1 1 3
grape 2 1 1
Note that 'LMN' is just used for representation purposes.
It won't logically appear here because it can only appear for the year 2014.
'''
df_year = df_year.pivot(index='fruit',columns='license',values='grade')
#3. Remove all fruits that have ANY NaN values
df_year=df_year.dropna(axis=1, how="any")
#4. Some additional filtering
#5. Function to calculate similarity between fruits
def similarity_score(fruit1, fruit2):
agreements=np.sum( ((fruit1 == 1) & (fruit2 == 1)) | \
( (fruit1 == 3) & (fruit2 == 3)))
disagreements=np.sum( ((fruit1 == 1) & (fruit2 == 3)) |\
( (fruit1 == 3) & (fruit2 == 1)))
return (((agreements-disagreements) /float(len(fruit1))) +1)/2)
#6. Create Network dataframe
network_df=pd.DataFrame(columns=['Source','Target','Weight'])
for i,c in enumerate(combinations(df_year,2)):
c1=df[[c[0]]].values.tolist()
c2=df[[c[1]]].values.tolist()
c1=[item for sublist in c1 for item in sublist]
c2=[item for sublist in c2 for item in sublist]
network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)]
return network_df
運行上面給出:
df_1946=func(df,1946)
df_1946.head()
Source Target Weight
Apple Orange 0.6
Apple Grape 0.3
Orange Grape 0.7
我想變平以上,以單行:
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
注意上面不會有3列,但實際上各地5000列。
最後,我想堆棧轉換數據框行得到的東西,如:
df_all_years
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
1947 0.7 0.25 0.8
..
2015 0.75 0.3 0.65
什麼是做到這一點的最好方法是什麼?
'(蘋果,橙)' - 它是一個字符串或一個元組? – MaxU
元組。你可以使用任何你喜歡的東西,只要有一種方法可以告訴特定單元格代表什麼組合。 – Melsauce