2014-02-25 51 views
2

我想使用熊貓閱讀https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx如何使用熊貓/ python處理excel文件標題

救了它在我的劇本是

import sys 
import pandas as pd 
inputfile = sys.argv[1] 
xl = pd.ExcelFile(inputfile) 
# print xl.sheet_names 
df = xl.parse(xl.sheet_names[0]) 
print df.head() 

然而,這似乎並沒有正確處理標題,因爲它給

GCSE and IGCSE1 results2,3 in selected subjects4 of pupils at the end of key stage 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 
0        Year: 2010/11 (Final)           NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   NaN 
1         Coverage: England           NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   NaN 
2            NaN           NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   NaN 
3 1. Includes International GCSE, Cambridge Inte...           NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   NaN 
4 2. Includes attempts and achievements by these...           NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   NaN 

所有這一切都應該被視爲註釋。

例如,如果您將電子表格加載到libreoffice中,則可以看到列標題被正確解析並出現在第15行的下拉菜單中,以便您選擇所需的項目。

如何讓熊貓自動檢測列標題的位置與libreoffice一樣?

回答

3

pandas是(are?)正確處理文件,並且完全按照你問它(他們?)的方式來處理。您沒有指定header值,這意味着它默認從第0行獲取列名稱。細胞的前幾行是不以某種根本的辦法的意見,他們只是沒有電池你有興趣

只需告訴parse你想跳過一些行:

>>> xl = pd.ExcelFile("GCSE IGCSE results v3.xlsx") 
>>> df = xl.parse(xl.sheet_names[0], skiprows=14) 
>>> df.columns 
Index([u'Local Authority Number', u'Local Authority Name', u'Local Authority Establishment Number', u'Unique Reference Number', u'School Name', u'Town', u'Number of pupils at the end of key stage 4', u'Number of pupils attempting a GCSE or an IGCSE', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-G', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-A', u'Number of students achieving 5 A*-A grades or more at GCSE or IGCSE'], dtype='object') 
>>> df.head() 
    Local Authority Number Local Authority Name \ 
0      201  City of london 
1      201  City of london 
2      202    Camden 
3      202    Camden 
4      202    Camden 

    Local Authority Establishment Number Unique Reference Number \ 
0        2016005     100001 
1        2016007     100003 
2        2024104     100049 
3        2024166     100050 
4        2024196     100051 

         School Name Town \ 
0 City of London School for Girls London 
1   City of London School London 
2    Haverstock School London 
3   Parliament Hill School London 
4    Regent High School London 

    Number of pupils at the end of key stage 4 \ 
0          105 
1          140 
2          200 
3          172 
4          174 

    Number of pupils attempting a GCSE or an IGCSE \ 
0           104 
1           140 
2           194 
3           169 
4           171 

    Number of students achieving 8 or more GCSE or IGCSE passes at A*-G \ 
0            100      
1            108      
2            SUPP      
3             22      
4             0      

    Number of students achieving 8 or more GCSE or IGCSE passes at A*-A \ 
0             87      
1             75      
2             0      
3             7      
4             0      

    Number of students achieving 5 A*-A grades or more at GCSE or IGCSE 
0            100     
1            123     
2             0     
3             34     
4            SUPP      

[5 rows x 11 columns] 
+0

謝謝。 libreoffice如何知道自動跳過前14行?這就是我認爲可能這個問題更多的原因。 – felix

+0

@felix:FWIW,當我在libreoffice中打開它時,我看到1-14行。我想原則上你可以檢測到一個分組表(或者他們被稱爲什麼)已經被定義並提取出來,但是你可以在一張表上有多個表。 – DSM

+0

我的意思是你看到第1-14行,但第15行顯然被libreoffice識別爲列標題。在我的版本中,第15行的每個字段都有一個下拉菜單。您是否得到相同的東西?這就是我看到http://postimg.org/image/fbgkgxelp/。 – felix