用熊貓清理HTML表格

我想在網站上閱讀表格並解析值。爲此，我做了以下內容：用熊貓清理HTML表格

url = 'http://www.astro.keele.ac.uk/jkt/debcat/'  
df = pd.read_html(url, header=0)

即使頭= 0，我仍然有一個頭留下的是DF [0]，所以我做了以下內容：

df = df[1] 
df1.shape 
(161, 11) 
df1.columns 
Index([u' System ', u' Period (days) ', u' V B-V ', u' Spectral type ', u' Mass (Msun)', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H] (dex) ', u' References and notes '], dtype='object')

然而，我不能得到

df1.Period

'數據幀' 對象沒有屬性 '期間'

我也不能這樣做，

df1.to_csv('junk.csv')

那麼，如何訪問列和清理表？謝謝！

來源

2014-03-03 Rohit

ISTM像它已經在一個公平的足夠的格式：

>>> url = 'http://www.astro.keele.ac.uk/jkt/debcat/' 
>>> df = pd.read_html(url, header=0) 
>>> df1 = df[1] 
>>> df1.head() 
    System Period (days) V B-V Spectral type \ 
0 V3903 Sgr   1.744  NaT    NaT 
1 V467 Vel   2.753  NaT    NaT 
2  EM Car   3.414  NaT    NaT 
3  Y Cyg   2.996  NaT    NaT 
4 V478 Cyg   2.881  NaT    NaT 

       Mass (Msun)    Radius (Rsun) \ 
0 27.27 ± 0.55 19.01 ± 0.44 8.088 ± 0.086 6.125 ± 0.060 
1  25.3 ± 0.7 8.25 ± 0.17  9.99 ± 0.09 3.49 ± 0.03 
2 22.89 ± 0.32 21.43 ± 0.33  9.35 ± 0.17 8.34 ± 0.14 
3 17.57 ± 0.27 17.04 ± 0.26  5.93 ± 0.07 5.78 ± 0.07 
4 16.67 ± 0.45 16.31 ± 0.35 7.423 ± 0.079 7.423 ± 0.079 

     Surface gravity (cgs)     log Teff (K) \ 
0 4.058 ± 0.016 4.143 ± 0.013 4.580 ± 0.021 4.531 ± 0.021 
1 3.842 ± 0.016 4.268 ± 0.017 4.559 ± 0.031 4.402 ± 0.046 
2 3.856 ± 0.017 3.926 ± 0.016 4.531 ± 0.026 4.531 ± 0.026 
3  4.16 ± 0.10 4.18 ± 0.10 4.545 ± 0.007 4.534 ± 0.007 
4 3.919 ± 0.015 3.909 ± 0.013 4.484 ± 0.015 4.485 ± 0.015 

       log (L/Lsun) [M/H] (dex) \ 
0 5.087 ± 0.029 4.658 ± 0.032   NaN 
1 5.187 ± 0.126 3.649 ± 0.110   NaN 
2  5.02 ± 0.10 4.92 ± 0.10   NaN 
3       NaN 0.00 ± 0.00 
4  4.63 ± 0.06 4.63 ± 0.06   NaN 

           References and notes 
0     Vaz et al. (1997A&A...327.1094V) 
1    Michalska et al. (2013MNRAS.429.1354M) 
2   Andersen & Clausen (1989A&A...213..183A) 
3  Simon, Sturm & Fiedler (1994A&A...292..507S) 
4 Popper & Hill (1991AJ....101..600P) Popper & E... 

[5 rows x 11 columns]

既然你知道如何看列：

>>> df1.columns 
Index([u' System ', u' Period (days) ', u' V B-V ', u' Spectral type ', u' Mass (Msun)', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H] (dex) ', u' References and notes '], dtype='object')

它應該不會令人驚訝df.Period不起作用 - 畢竟沒有任何列被稱爲Period。熊貓不會隨機猜測哪一個看起來最接近。如果你要處理的列名，你可以這樣做

>>> df1.columns = [x.strip() for x in df1.columns] # get rid of the leading/trailing spaces 
>>> df1 = df1.rename(columns={"Period (days)": "Period"})

之後df1["Period"]（首選）和df1.Period（快捷方式）將工作：

>>> df1["Period"].describe() 
count 161.000000 
mean  32.035019 
std  98.392634 
min  0.452000 
25%  2.293000 
50%  3.895000 
75%  9.945000 
max  771.781000 
Name: Period, dtype: float64

「我也不能這樣做「df1.to_csv('junk.csv')」不是錯誤報告，因爲你不能解釋爲什麼你不能，或者當你這樣做時會發生什麼。我假設你得到一個編碼錯誤：

>>> df1.to_csv("out.csv") 
Traceback (most recent call last): 
[...] 
File "lib.pyx", line 845, in pandas.lib.write_csv_rows (pandas/lib.c:14261) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb1' in position 6: ordinal not in range(128)

可如果你指定適當的編碼來避免：

>>> df1.to_csv("out.csv", encoding="utf8")

來源

2014-03-03 16:21:13 DSM

完美，這就是我的想法。 – Rohit

列名稱解析的u' Period (days) '，所以要訪問列：

>>> df1[ u' Period (days) ' ]

這麼說，你需要使用一個HTML解析這種類型的作業庫;例如BeautifulSoup可以非常整齊地做到這一點;

>>> from bs4 import BeautifulSoup 
>>> from urllib2 import urlopen 

>>> url = 'http://www.astro.keele.ac.uk/jkt/debcat/' 
>>> html = urlopen(url).read() 
>>> soup = BeautifulSoup(html) 

>>> # catch the target table by its attributes 
>>> table = soup.find('table', attrs={'frame':'BOX', 'rules':'ALL'}) 

>>> # parse the table as a list of lists; each row as a single list 
>>> tbl = [[td.getText() for td in tr.findAll(['td', 'th'])] for tr in table.findAll('tr')]

tbl最後是目標表作爲列表的列表;即。每行是該行中單元格值的列表;例如tbl[0]簡直就是頭：

>>> tbl[0] 
[u' System ', u' Period (days) ', u' V B-V ', u' Spectral type ', u' Mass (Msun)', u' Radius (Rsun) ', u' Surface gravity (cgs) ', u' log Teff (K) ', u' log (L/Lsun) ', u' [M/H] (dex) ', u' References and notes ']

來源

2014-03-03 16:06:16

感謝您的解決方案。這確實有用。但是，我正在尋找一些可以添加到DataFrame中的東西，即在Pandas中。在目前的狀態下操作和操作數據是一件非常痛苦的事情。儘管如此。 – Rohit

用熊貓清理HTML表格

回答

相關問題