2014-01-09 53 views
0

我需要從HTML表格中提取一個值,該表格可以從txt文件中的web服務器中獲取。確切的要求是將最後一次讀取的時間明智地提取到變量中。BeautifulSoup Python - HTML表格數據問題

這張表的格式並不完美,我認爲。

下面是表的HTML代碼的例子...

<table border="1" rules="all"> 
<col /> 
<col /> 
    <col align="char" char="." /> 
    <col align="char" char="." /> 
    <col /> 
    <col /> 
    <col align="char" char="m" /> 
    <col align="char" char="m" /> 
    <col align="char" char="." /> 
    <col align="char" char="," /> 
    <tr> 
    <th colspan="2" rowspan="2">Date &amp; time</th> 
    <th rowspan="2">Temp</th> 
    <th rowspan="2">Feels like</th> 
    <th rowspan="2">Humidity</th> 
    <th colspan="3">Wind</th> 
    <th rowspan="2">Rain</th> 
    <th rowspan="2">Pressure</th> 
    </tr> 
    <tr> 
    <th>dir</th> 
    <th>ave</th> 
    <th>gust</th> 
    </tr> 
    <tr> 
    <td>2014/01/08</td> 
    <td>1056 GMT</td> 
    <td>11.0 &deg;C</td> 
    <td>9.8 &deg;C</td> 
    <td>74%</td> 
    <td>NNW</td> 
    <td>1 mph</td> 
    <td>6 mph</td> 
    <td>0.3 mm</td> 
    <td>1032.4 hPa, rising</td> 
    </tr> 
    <tr> 
    <td></td> 
    <td>1159 GMT</td> 
    <td>10.8 &deg;C</td> 
    <td>9.7 &deg;C</td> 
    <td>74%</td> 
    <td>SSE</td> 
    <td>1 mph</td> 
    <td>4 mph</td> 
    <td>0.0 mm</td> 
    <td>1032.0 hPa, rising slowly</td> 
    </tr> 
    <tr> 
    <td></td> 
    <td>1258 GMT</td> 
    <td>11.0 &deg;C</td> 
    <td>9.9 &deg;C</td> 
    <td>73%</td> 
    <td>SSE</td> 
    <td>1 mph</td> 
    <td>4 mph</td> 
    <td>0.0 mm</td> 
    <td>1031.5 hPa, falling slowly</td> 
    </tr> 
    <tr> 
    <td></td> 
    <td>1357 GMT</td> 
    <td>10.8 &deg;C</td> 
    <td>9.7 &deg;C</td> 
    <td>75%</td> 
    <td>SSW</td> 
    <td>1 mph</td> 
    <td>4 mph</td> 
    <td>0.0 mm</td> 
    <td>1030.7 hPa, falling</td> 
    </tr> 
    <tr> 
    <td></td> 
    <td>1456 GMT</td> 
    <td>10.3 &deg;C</td> 
    <td>9.3 &deg;C</td> 
    <td>77%</td> 
    <td>ENE</td> 
    <td>1 mph</td> 
    <td>4 mph</td> 
    <td>0.0 mm</td> 
    <td>1030.0 hPa, falling</td> 
    </tr> 
    <tr> 
    <td></td> 
    <td>1600 GMT</td> 
    <td>9.7 &deg;C</td> 
    <td>8.7 &deg;C</td> 
    <td>81%</td> 
    <td>WNW</td> 
    <td>1 mph</td> 
    <td>3 mph</td> 
    <td>0.0 mm</td> 
    <td>1028.7 hPa, falling</td> 
    </tr> 
    <tr> 
    <td></td> 
    <td>1658 GMT</td> 
    <td>8.9 &deg;C</td> 
    <td>7.9 &deg;C</td> 
    <td>86%</td> 
    <td>NNE</td> 
    <td>1 mph</td> 
    <td>4 mph</td> 
    <td>0.0 mm</td> 
    <td>1026.9 hPa, falling quickly</td> 
    </tr> 
</table> 

我有以下Python代碼這使所有數據到行

#!/usr/bin/python 
from BeautifulSoup import BeautifulSoup 
import urllib2 
data = "http://****************/weather_station/data/6hrs.txt" 
req = urllib2.Request(data) 
page = urllib2.urlopen(req) 
soup = BeautifulSoup(page) 

table = soup.find('table') 
for row in table.findAll('tr'): 
     col = row.findAll('td') 
#  time = col[0].string 
#  temp = col[1].string 

print col 

這就是我卡住。 time = col [0] .string返回錯誤列表索引超出範圍,這意味着列表中沒有任何內容,但是如果我打印col,它會顯示我希望提取的數據。

有什麼建議嗎?

下面的答案對該表非常有用。我也希望從這樣一個表中獲取同樣的數據...

<table border="1" rules="rows" cellspacing="0" cellpadding="5"> 
    <col /> 
    <col /> 
    <col align="char" char="." /> 
    <col align="char" char="." /> 
    <col /> 
    <col /> 
    <col align="char" char="m" /> 
    <col align="char" char="m" /> 
    <col align="char" char="." /> 
    <col align="char" char="," /> 
    <tr> 
    <th rowspan="2">Time</th> 
    <th rowspan="2">Temp</th> 
    <th rowspan="2">Feels like</th> 
    <th rowspan="2">Humidity</th> 
    <th colspan="3">Wind</th> 
    <th rowspan="2">Rain</th> 
    <th rowspan="2">Pressure</th> 
    </tr> 
    <tr> 
    <th>dir</th> 
    <th>ave</th> 
    <th>gust</th> 
    </tr> 
<tr> 
<td>12:45 <small>GMT:</small></td> 
<td>8.8<small>C</small></td> 
<td>7.1 <small>&deg;C</small></td> 
<td>66<small>%</small></td> 
<td>W </td> 
<td>1 <small>mph</small></td> 
<td>2 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>12:40 <small>GMT:</small></td> 
<td>8.9<small>C</small></td> 
<td>6.9 <small>&deg;C</small></td> 
<td>66<small>%</small></td> 
<td>SE </td> 
<td>2 <small>mph</small></td> 
<td>4 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>12:34 <small>GMT:</small></td> 
<td>8.8<small>C</small></td> 
<td>6.3 <small>&deg;C</small></td> 
<td>66<small>%</small></td> 
<td>NE </td> 
<td>3 <small>mph</small></td> 
<td>7 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>12:29 <small>GMT:</small></td> 
<td>9.0<small>C</small></td> 
<td>6.4 <small>&deg;C</small></td> 
<td>64<small>%</small></td> 
<td>NW </td> 
<td>3 <small>mph</small></td> 
<td>6 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>12:24 <small>GMT:</small></td> 
<td>9.6<small>C</small></td> 
<td>7.4 <small>&deg;C</small></td> 
<td>63<small>%</small></td> 
<td>S </td> 
<td>2 <small>mph</small></td> 
<td>5 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>12:19 <small>GMT:</small></td> 
<td>10.1<small>C</small></td> 
<td>7.4 <small>&deg;C</small></td> 
<td>61<small>%</small></td> 
<td>SW </td> 
<td>4 <small>mph</small></td> 
<td>6 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>12:14 <small>GMT:</small></td> 
<td>10.8<small>C</small></td> 
<td>8.9 <small>&deg;C</small></td> 
<td>61<small>%</small></td> 
<td>SE </td> 
<td>2 <small>mph</small></td> 
<td>2 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>12:09 <small>GMT:</small></td> 
<td>10.7<small>C</small></td> 
<td>8.8 <small>&deg;C</small></td> 
<td>61<small>%</small></td> 
<td>N </td> 
<td>2 <small>mph</small></td> 
<td>3 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>12:04 <small>GMT:</small></td> 
<td>10.3<small>C</small></td> 
<td>8.5 <small>&deg;C</small></td> 
<td>64<small>%</small></td> 
<td>NE </td> 
<td>2 <small>mph</small></td> 
<td>3 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>11:58 <small>GMT:</small></td> 
<td>9.3<small>C</small></td> 
<td>7.6 <small>&deg;C</small></td> 
<td>65<small>%</small></td> 
<td>N </td> 
<td>1 <small>mph</small></td> 
<td>2 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>11:53 <small>GMT:</small></td> 
<td>9.3<small>C</small></td> 
<td>7.8 <small>&deg;C</small></td> 
<td>65<small>%</small></td> 
<td>W </td> 
<td>0 <small>mph</small></td> 
<td>2 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1022 <small>hPa</small></td> 
</tr> 
<tr> 
<td>11:48 <small>GMT:</small></td> 
<td>8.8<small>C</small></td> 
<td>7.1 <small>&deg;C</small></td> 
<td>66<small>%</small></td> 
<td>W </td> 
<td>1 <small>mph</small></td> 
<td>2 <small>mph</small></td> 
<td>0.0 <small>mm</small></td> 
<td>1021 <small>hPa</small></td> 
</tr> 
</table> 

使用相同的代碼如下

table = soup.find('table') 
for row in table.findAll('tr')[1:]: 
     col = row.findAll('td') 
     if len(col) >= 2: 
       time = col[0].string 
       temp = col[1].string 
print time 
print temp 

時間&溫度返回「無」

如果我打印col所有的值都在那裏。爲什麼len(col)> = 2不適用於該數據?

回答

0

你崩潰,因爲你試圖從這個TR獲得TD的:

<tr> 
<th colspan="2" rowspan="2">Date &amp; time</th> 
<th rowspan="2">Temp</th> 
<th rowspan="2">Feels like</th> 
<th rowspan="2">Humidity</th> 
<th colspan="3">Wind</th> 
<th rowspan="2">Rain</th> 
<th rowspan="2">Pressure</th> 
</tr> 

只需添加這樣的事情:

col = row.findAll('td') 
if len(col) >= 2: 
    time = col[0].string 
    temp = col[1].string 
+0

十分感謝,對於上表中的作品,我已經添加了以下只顯示溫度而不顯示溫度。 a = re.sub(r'[^ 0-9 \ - \ d。]','',temp) 打印時間 打印 – user3176960