2016-02-17 177 views
-9

我想提取下列表格內容並通過pandas將其保存在一個CSV文件中,但僅提取日期(例如Thu,11/02)和所有值,這些值由€ /兆瓦時。非常感謝你提前。通過Python獲取html表格內容

的源代碼:

<table cellspacing="0" cellpadding="0" border="0" class="list hours responsive" width="100%"> 
<tbody> 
    <tr> 
     <th class="title"></th> 
     <th class="units"></th> 
     <th>Thu, 11/02</th> 
     <th>Fri, 12/02</th> 
     <th>Sat, 13/02</th> 
     <th>Sun, 14/02</th> 
     <th>Mon, 15/02</th> 
     <th>Tue, 16/02</th> 
     <th>Wed, 17/02</th> 
    </tr> 
    <tr class="no-border"> 
     <td class="title"> 
      00 - 01 
     </td> 
     <td>€/MWh</td> 
     <td>23.82</td> 
     <td>22.81</td> 
     <td>22.23</td> 
     <td>13.06</td> 
     <td>16.57</td> 
     <td>25.99</td> 
     <td>32.45</td> 
    </tr> 
    <tr> 
     <td>&nbsp;</td> 
     <td>MWh</td> 
     <td>10,266.0</td> 
     <td>9,626.6</td> 
     <td>12,255.9</td> 
     <td>11,084.7</td> 
     <td>11,039.5</td> 
     <td>13,134.7</td> 
     <td>9,958.1</td> 
    </tr> 
    <tr class="no-border"> 
     <td class="title"> 
      01 - 02 
     </td> 
     <td>€/MWh</td> 
     <td>21.48</td> 
     <td>21.59</td> 
     <td>21.10</td> 
     <td>12.17</td> 
     <td>16.00</td> 
     <td>23.65</td> 
     <td>31.27</td> 
    </tr> 
    <tr> 
     <td>&nbsp;</td> 
     <td>MWh</td> 
     <td>9,843.3</td> 
     <td>9,494.4</td> 
     <td>11,823.3</td> 
     <td>10,531.9</td> 
     <td>9,970.5</td> 
     <td>12,875.6</td> 
     <td>9,958.8</td> 
    </tr> 
    <tr class="no-border"> 
     <td class="title"> 
      02 - 03 
     </td> 
     <td>€/MWh</td> 
     <td>21.00</td> 
     <td>21.30</td> 
     <td>20.21</td> 
     <td>8.81</td> 
     <td>14.55</td> 
     <td>22.91</td> 
     <td>29.72</td> 
    </tr> 
    <tr> 
     <td>&nbsp;</td> 
     <td>MWh</td> 
     <td>9,857.0</td> 
     <td>9,427.9</td> 
     <td>11,755.2</td> 
     <td>10,061.9</td> 
     <td>9,881.7</td> 
     <td>12,841.0</td> 
     <td>9,896.9</td> 
    </tr> 
    <tr class="no-border"> 
     <td class="title"> 
      03 - 04 
     </td> 
     <td>€/MWh</td> 
     <td>19.94</td> 
     <td>19.86</td> 
     <td>19.94</td> 
     <td>6.74</td> 
     <td>13.14</td> 
     <td>22.04</td> 
     <td>27.44</td> 
    </tr> 
    <tr> 
     <td>&nbsp;</td> 
     <td>MWh</td> 
     <td>9,486.2</td> 
     <td>10,492.7</td> 
     <td>12,609.1</td> 
     <td>11,216.6</td> 
     <td>10,199.9</td> 
     <td>11,209.7</td> 
     <td>9,698.5</td> 
    </tr> 
</tbody> 

+2

請[編輯]你的問題,1)改善你的HTML的縮進2)添加你已經嘗試了Python代碼。 –

+0

http://stackoverflow.com/questions/11790535/extracting-data-from-html-table – CodeMonkey

回答

0

下面的代碼會給你行你的頁面的明智結果:

from bs4 import BeautifulSoup 
import urllib.request 

response = urllib.request.urlopen('file:///F:/test.html') 
html = response.read()  
soup = BeautifulSoup(html) 
table = soup.find('table', attrs={'class': 'list hours responsive'}) 
rows = table.findAll('tr') 
for tr in rows: 
    text = [] 
    cols = tr.findAll('td') 
    for td in cols: 
    try: 
     text = ''.join(td.find(text=True)) 
    except Exception: 
     text = "000" 
    print(text+",") 

我測試的HTML頁面保存爲test.html的在F:驅動器

<html> 
<body> 
<table cellspacing="0" cellpadding="0" border="0" class="list hours responsive" width="100%"> 
       <tbody> 
       <tr> 
        <th class="title"></th> 
        <th class="units"></th> 
               <th>Thu, 11/02</th> 
               <th>Fri, 12/02</th> 
               <th>Sat, 13/02</th> 
               <th>Sun, 14/02</th> 
               <th>Mon, 15/02</th> 
               <th>Tue, 16/02</th> 
               <th>Wed, 17/02</th> 

       </tr> 
             <tr class="no-border"> 
         <td class="title"> 
                   00 - 01 
                 </td> 
         <td>€/MWh</td> 
                 <td>23.82</td> 
                 <td>22.81</td> 
                 <td>22.23</td> 
                 <td>13.06</td> 
                 <td>16.57</td> 
                 <td>25.99</td> 
                 <td>32.45</td> 
               </tr> 
        <tr> 
         <td>&nbsp;</td> 
         <td>MWh</td> 
                 <td>10,266.0</td> 
                 <td>9,626.6</td> 
                 <td>12,255.9</td> 
                 <td>11,084.7</td> 
                 <td>11,039.5</td> 
                 <td>13,134.7</td> 
                 <td>9,958.1</td> 
               </tr> 
             <tr class="no-border"> 
         <td class="title"> 
                   01 - 02 
                 </td> 
         <td>€/MWh</td> 
                 <td>21.48</td> 
                 <td>21.59</td> 
                 <td>21.10</td> 
                 <td>12.17</td> 
                 <td>16.00</td> 
                 <td>23.65</td> 
                 <td>31.27</td> 
               </tr> 
        <tr> 
         <td>&nbsp;</td> 
         <td>MWh</td> 
                 <td>9,843.3</td> 
                 <td>9,494.4</td> 
                 <td>11,823.3</td> 
                 <td>10,531.9</td> 
                 <td>9,970.5</td> 
                 <td>12,875.6</td> 
                 <td>9,958.8</td> 
               </tr> 
             <tr class="no-border"> 
         <td class="title"> 
                   02 - 03 
                 </td> 
         <td>€/MWh</td> 
                 <td>21.00</td> 
                 <td>21.30</td> 
                 <td>20.21</td> 
                 <td>8.81</td> 
                 <td>14.55</td> 
                 <td>22.91</td> 
                 <td>29.72</td> 
               </tr> 
        <tr> 
         <td>&nbsp;</td> 
         <td>MWh</td> 
                 <td>9,857.0</td> 
                 <td>9,427.9</td> 
                 <td>11,755.2</td> 
                 <td>10,061.9</td> 
                 <td>9,881.7</td> 
                 <td>12,841.0</td> 
                 <td>9,896.9</td> 
               </tr> 
             <tr class="no-border"> 
         <td class="title"> 
                   03 - 04 
                 </td> 
         <td>€/MWh</td> 
                 <td>19.94</td> 
                 <td>19.86</td> 
                 <td>19.94</td> 
                 <td>6.74</td> 
                 <td>13.14</td> 
                 <td>22.04</td> 
                 <td>27.44</td> 
               </tr> 
        <tr> 
         <td>&nbsp;</td> 
         <td>MWh</td> 
                 <td>9,486.2</td> 
                 <td>10,492.7</td> 
                 <td>12,609.1</td> 
                 <td>11,216.6</td> 
                 <td>10,199.9</td> 
                 <td>11,209.7</td> 
                 <td>9,698.5</td> 
               </tr> 

            </tbody> 
      </table> 
      </body> 
</html> 

輸出的代碼如下:

00 - 01, 
€/MWh, 
23.82, 
22.81, 
22.23, 
13.06, 
16.57, 
25.99, 
32.45, 
, 
MWh, 
10,266.0, 
9,626.6, 
12,255.9, 
11,084.7, 
11,039.5, 
13,134.7, 
9,958.1, 

01 - 02, 
€/MWh, 
21.48, 
21.59, 
21.10, 
12.17, 
16.00, 
23.65, 
31.27, 
, 
MWh, 
9,843.3, 
9,494.4, 
11,823.3, 
10,531.9, 
9,970.5, 
12,875.6, 
9,958.8, 

02 - 03, 
€/MWh, 
21.00, 
21.30, 
20.21, 
8.81, 
14.55, 
22.91, 
29.72, 
, 
MWh, 
9,857.0, 
9,427.9, 
11,755.2, 
10,061.9, 
9,881.7, 
12,841.0, 
9,896.9, 

03 - 04, 
€/MWh, 
19.94, 
19.86, 
19.94, 
6.74, 
13.14, 
22.04, 
27.44, 
, 
MWh, 
9,486.2, 
10,492.7, 
12,609.1, 
11,216.6, 
10,199.9, 
11,209.7, 
9,698.5, 
0

有一個編碼問題,您應該在打印之前編碼您的響應。

0

您可以參考這個示例代碼:

#!/usr/bin/env python 
# -*- coding:utf-8 -*- 

import requests 
from bs4 import BeautifulSoup 

url='http://news.sina.com.cn/' 
res=requests.get(url) 
res.encoding='utf-8'  #This is the key code 
soup=BeautifulSoup(res.text,'html.parser') 
tags=soup.select('a') 

for tag in tags: 
    try: 
     link=tag['href'] 
     link=str(link) 
     if link.startswith('http'): 
      print(link) 
     else: 
      print(False) 
    except: 
     print('null')