2017-08-02 86 views
0

我能夠成功地從網站提取數據,除了一個字段,其標籤是img alt。下面是代碼:使用美麗的湯提取img alt標籤的文本

#import pandas as pd 
import re 
from urllib2 import urlopen 
from bs4 import BeautifulSoup 

# gets a file-like object using urllib2.urlopen 
url = 'http://ecal.forexpros.com/e_cal.php?duration=daily' 
html = urlopen(url) 

soup = BeautifulSoup(html) 

# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr' 
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}): 
    # finds desired data by looking up <td> elements with class names 
    event = tr.find('td', {'class': 'ec_td_event'}).text 
    currency = tr.find('td', {'class': 'ec_td_currency'}).text 
    actual = tr.find('td', {'class': 'ec_td_actual'}).text 
    forecast = tr.find('td', {'class': 'ec_td_forecast'}).text 
    previous = tr.find('td', {'class': 'ec_td_previous'}).text 
    time = tr.find('td', {'class': 'ec_td_time'}).text 
    importance = tr.find('td', {'class': 'ec_td_importance'}).text 

    # the returned strings are unicode, so to print them we need a unicode string 
    print u'{:3}\t{}\t{:5}\t{:8}\t{:8}\t{:8}\t{}'.format(currency, importance, time, actual, forecast, previous, event) 

輸出的前幾記錄如下:

JPY  01:00 43.8  43.6  43.3  Household Confidence 
CHF  01:45 -3   -3   -8   SECO Consumer Climate 
RON  02:00 2.50%     3.30%  PPI (YoY) 
EUR  03:00 -26.9K  -66.5K  -98.3K  Spanish Unemployment Change 
CHF  03:15 1.5%  1.3%  -0.8%  Retail Sales (YoY) 
CHF  03:30 60.9  58.9  60.1  SVME PMI 
GBP  04:30 51.9  54.5  54.8  Construction PMI 

importance字段未在上面的輸出顯示(大概是因爲數據被包含在imgalt )。

有誰知道如何解決這個問題?

謝謝!

編輯:

問題是通過更換得到解決:

importance = tr.find('td', {'class': 'ec_td_importance'}).text 

有:

importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt') 

回答

1

在此更換你的importance行:

importance = tr.find('td', {'class': 'ec_td_importance'}).img['alt']