2015-11-05 36 views
0

我有以下格式的CSV,轉換爲多列由規格化列

print rfd.iloc[:5,:5] 

          Sub-division January 2010 Actual January 2010 Normal January 2011 Actual February 2010 Actual 
0   Andaman and Nicobar Islands     98.2     53.7     222.5      5.8 
1      Arunachal Pradesh     0.4     50.1     37.6     10.0 
2      Assam and Meghalaya     0.2     16.4     9.0      3.4 
3 Nagaland,Manipur, Mizoram, and Tripura     0.9     13.7     7.9     10.9 
4  Sub-Himalayan,West Bengal & Sikkim     1.7     26.6     7.1      6.4 

如何將其轉換爲多列。第一級將是年,然後是月和類型。

rfd.columns 
Out[89]: 
Index([u'Sub-division ', u'January 2010 Actual ', u'January 2010 Normal ', 
     u'January 2011 Actual ', u'February 2010 Actual ', 
    .... 
     u'December 2010 Normal ', u' December 2011 Actual '], 
     dtype='object') 

我想是這樣的rfd.columns = rfd.columns.str.split(" ")然後數據幀成爲TypeError: unhashable type: 'list'。如果它只是一個文件,我可以在csv和加載中更新它,但它是可重複的過程,所以尋找一些我可以迭代文件的解決方案。

添加兩排字典,

{'April 2010 Normal': {0: 81.5, 1: 278.80000000000001}, 
'April 2010 Actual': {0: 12.699999999999999, 1: 245.80000000000001}, 
'April 2011 Actual': {0: 83.700000000000003, 1: 114.7}, 
'August 2010 Actual': {0: 550.0, 1: 343.30000000000001}, 
'August 2010 Normal': {0: 403.80000000000001, 1: 359.89999999999998}, 
'August 2011 Actual': {0: 513.0, 1: 225.80000000000001}, 
'December 2010 Normal': {0: 145.5, 1: 38.399999999999999}, 
'December 2010 Actual': {0: 254.40000000000001, 1: 6.0}, 
'December 2011 Actual': {0: 246.30000000000001, 1: 10.300000000000001}, 
'February 2010 Actual': {0: 5.7999999999999998, 1: 10.0}, 
'February 2010 Normal': {0: 29.199999999999999, 1: 98.0}, 
'February 2011 Actual': {0: 81.900000000000006, 1: 36.799999999999997}, 
'January 2010 Normal': {0: 53.700000000000003, 1: 50.100000000000001}, 
'January 2010 Actual': {0: 98.200000000000003, 1: 0.40000000000000002}, 
'January 2011 Actual': {0: 222.5, 1: 37.600000000000001}, 
'July 2010 Normal': {0: 407.69999999999999, 1: 536.10000000000002}, 
'July 2010 Actual': {0: 522.10000000000002, 1: 426.0}, 
'July 2011 Actual': {0: 575.79999999999995, 1: 553.5}, 
'June 2010 Normal': {0: 438.60000000000002, 1: 500.39999999999998}, 
'June 2011 Actual': {0: 418.39999999999998, 1: 336.80000000000001}, 
'June 2010 Actual': {0: 435.0, 1: 397.30000000000001}, 
'March 2010 Normal': {0: 25.0, 1: 179.69999999999999}, 
'March 2010 Normal': {0: 20.5, 1: 164.40000000000001}, 
'March 2011 Actual': {0: 305.5, 1: 121.5}, 
'March 2010 Actual': {0: 0.40000000000000002, 1: 143.59999999999999}, 
'May 2010 Actual': {0: 310.69999999999999, 1: 273.80000000000001}, 
'May 2010 Normal': {0: 358.5, 1: 291.89999999999998}, 
'May 2011 Actual': {0: 305.69999999999999, 1: 157.80000000000001}, 
'November 2010 Normal': {0: 253.69999999999999, 1: 45.799999999999997}, 
'November 2010 Actual': {0: 281.39999999999998, 1: 59.700000000000003}, 
'November 2011 Actual': {0: 126.0, 1: 19.800000000000001}, 
'October 2010 Actual': {0: 415.19999999999999, 1: 84.400000000000006}, 
'October 2010 Normal': {0: 296.69999999999999, 1: 183.0}, 
'October 2011 Actual': {0: 183.80000000000001, 1: 46.799999999999997}, 
'September 2010 Normal': {0: 432.39999999999998, 1: 371.60000000000002}, 
'September 2010 Actual': {0: 261.30000000000001, 1: 407.39999999999998}, 
'September 2011 Actual': {0: 770.89999999999998, 1: 262.0}, 
'Sub-division': {0: 'Andaman and Nicobar Islands ', 1: 'Arunachal Pradesh'}, 
'october 2010 Normal': {0: 297.80000000000001, 1: 159.09999999999999}} 

回答

1

我敢肯定這是不是這樣做的「最好方式」,可能不太最佳

import pandas as pd 

a = pd.read_csv('data.csv', sep=';') 
b = a.set_index('Sub-division').unstack().reset_index() 
c = b['level_0'] 

d = c.str.extract('(?P<Month>[A-Za-z]*) +(?P<Year>[0-9][\w\d]*) +(?P<Level>[A-Za-z]*)') 

e = pd.concat([b[['Sub-division',0]], d], axis=1) 

f = e.set_index(['Sub-division', 'Year', 'Month', 'Level']) 

f = f.unstack(['Year','Month','Level']) 

f.columns = f.columns.droplevel(0) 

f.sortlevel(level=0,axis=1) 

但它確實是你想要的功能可能是 pd.str.extract

它輸出這個:

Year          2010     2011 
Month         February January  January 
Level         Actual Actual Normal Actual 
Sub-division               
Andaman and Nicobar Islands    5.8 98.2 53.7 222.5 
Arunachal Pradesh       10.0  0.4 50.1 37.6 
Assam and Meghalaya      3.4  0.2 16.4  9.0 
Nagaland,Manipur, Mizoram and Tripura  10.9  0.9 13.7  7.9 
Sub-Himalayan,West Bengal & Sikkim   6.4  1.7 26.6  7.1 

大熊貓有特殊的工具來處理時間序列,所以在這裏你可以看到更好的表示。

+0

謝謝,對於這篇文章。我並不是在尋找有效的方法。我得到這個錯誤:ValueError:索引包含重複條目,無法重塑'f = f.unstack(['Year','Month','Level'])''。 +1直到此時爲止 – WoodChopper

+0

適合我。我使用熊貓0.17.0也許不同的版本?我唯一的想法是,最初的數據框與你的不同。無論如何,最後的線條純粹是化妝品。 – luismf

+0

我的'0.16.2'可能是。我使用了與dict相同的數據。 'a = pandas.DataFrame(copypaste)' – WoodChopper