2
我有一個數據幀,目前看起來如下,有262800行和3列。我的數據幀是目前如下:重構數據幀
Currency Maturity value
0 GBP 0.08333333 4.709456
1 GBP 0.08333333 4.713099
2 GBP 0.08333333 4.707237
3 GBP 0.08333333 4.705043
4 GBP 0.08333333 4.697150
5 GBP 0.08333333 4.710647
6 GBP 0.08333333 4.701150
7 GBP 0.08333333 4.694639
8 GBP 0.08333333 4.686111
9 GBP 0.08333333 4.714750
......
262770 GBP 25 2.432869
我想數據幀是下面的表格中。我已經採取了一些措施,包括在下面的代碼中使用melt
,但由於某種原因,擺脫了我的Date
列,並導致上面的數據框。我不確定如何獲取日期欄後面,並獲得以下數據框:
Maturity Date Currency Yield_pct
0 0.08333333 2005-01-04 GBP 4.709456
1 0.08333333 2005-01-05 GBP 4.713099
2 0.08333333 2005-01-06 GBP 4.707237
....
9 25 2005-01-04 GBP 2.432869
我的代碼如下:
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel(url, sheetname=6)
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'Maturity'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'Maturity'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
idx = 0
values = ['GBP'] * len(combined_data.index)
combined_data.insert(idx, 'Currency', values)
#print combined_data.columns.values
#I had to do the melt
combined_data = pd.melt(combined_data,id_vars=['Currency'])#Arbitrarily melted on 'Currency' as for some reason when I do print combined_data.columns.values I see that 'Currency' corresponds to 0.08333333, etc.
print combined_data
太好了。我可以再問一個問題嗎?有沒有辦法將列名'value'改爲'Yield_pct'? – Jojo
當然,我個人喜歡使用字典,因爲很容易看出它以前是什麼:''result.rename(columns = {'value':'Yield_pct'},inplace = True)'' – bastewart