2017-08-06 143 views
0

我試圖從網站檢索數據。我的代碼如下:如何使用美麗的湯從標籤中提取數據

import re 
from urllib2 import urlopen 
from bs4 import BeautifulSoup 

# gets a file-like object using urllib2.urlopen 
url = 'http://ecal.forexpros.com/e_cal.php?duration=weekly' 
html = urlopen(url) 

soup = BeautifulSoup(html) 

# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr' 
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}): 
    # finds desired data by looking up <td> elements with class names 

    event = tr.find('td', {'class': 'ec_td_event'}).text 
    currency = tr.find('td', {'class': 'ec_td_currency'}).text 
    actual = tr.find('td', {'class': 'ec_td_actual'}).text 
    forecast = tr.find('td', {'class': 'ec_td_forecast'}).text 
    previous = tr.find('td', {'class': 'ec_td_previous'}).text 
    time = tr.find('td', {'class': 'ec_td_time'}).text 
    importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt') 

    # the returned strings are unicode, so to print them we need to use a unicode string 
    if importance == 'High': 
     print(u'\t{:5}\t{}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, importance, currency, event, actual, forecast, previous)) 

在結果集中的前幾個記錄如下:

05:00 High EUR CPI (YoY)         1.3%  1.3%  1.3%  
10:00 High USD Pending Home Sales (MoM)     1.5%  0.7%  -0.7% 
21:45 High CNY Caixin Manufacturing PMI     51.1  50.4  50.4  
00:30 High AUD RBA Interest Rate Decision     1.50%  1.50%  1.50% 
00:30 High AUD RBA Rate Statement               
03:55 High EUR German Manufacturing PMI     58.1  58.3  58.3  
03:55 High EUR German Unemployment Change     -9K   -5K   6K  

我想現在從以下網站檢索類似的數據:

https://www.fxstreet.com/economic-calendar

爲此,我修改了上述代碼如下:

import re 
from urllib2 import urlopen 
from bs4 import BeautifulSoup 

# gets a file-like object using urllib2.urlopen 
url = 'https://www.fxstreet.com/economic-calendar' 
html = urlopen(url) 

soup = BeautifulSoup(html) 


for tr in soup.find_all('tr', {'class': re.compile('fxst-tr-event fxst-oddRow fxit-eventrow fxst-evenRow ')}): 
    # finds desired data by looking up <div> elements with class names 

    event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text 
    currency = tr.find('div', {'class': 'fxit-event-name'}).text 
    actual = tr.find('div', {'class': ' fxit-actual'}).text 
    forecast = tr.find('div', {'class': 'fxit-consensus'}).text 
    previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text 
    time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text 
# importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt') 

    # the returned strings are unicode, so to print them we need to use a unicode string 
    if importance == 'High': 
     print(u'\t{:5}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, currency, event, actual, forecast, previous)) 

此代碼不會返回任何結果(大概是因爲我引用了不正確的標記和/或類)。有沒有人看到我的錯誤在哪裏?

謝謝!

+0

我在網站上看了一下,沒有_class_名爲'fxst-tr-event fxst-oddRow fxit-eventrow fxst-evenRow' – ksai

回答

1

您應該使用selenium + Chromedriver/PhantomJS通過動態創建JavaScript內容解析,urllib2不處理。我認爲在這裏使用regex沒什麼意義,您可以使用lxml解析器來允許多個類並在列表中使用它們。下面是使用已經提到的工具的例子:

from bs4 import BeautifulSoup 
from selenium import webdriver 

url = 'https://www.fxstreet.com/economic-calendar' 

driver = webdriver.Chrome() 
driver.get(url) 
html = driver.page_source 
soup = BeautifulSoup(html, 'lxml') 

for tr in soup.findAll('tr',{'class':['fxst-tr-event', 'fxst-oddRow', 'fxit-eventrow', 'fxst-evenRow', 'fxs_cal_nextEvent']}): 
    event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text 
    currency = tr.find('div', {'class': 'fxit-event-name'}).text 
    actual = tr.find('div', {'class': 'fxit-actual'}).text 
    forecast = tr.find('div', {'class': 'fxit-consensus'}).text 
    previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text 
    time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text 

    print(time, currency, event, actual, forecast, previous) 

lxml是庫本身,您可以使用標準html.parser處理多個類,但它不是在我看來那樣直觀。此代碼打印:

14:00 
CAD          14:00 None 59.2 
61.6          
14:00 
CAD          14:00 52.9 
63.9          
17:00 
USD          17:00 765 
... 
... 

,因爲我真的不知道你想他們是什麼,我沒有改變任何變量,因此,進一步的調整是和格式化輸出應該是理想的。

+0

謝謝。我試圖通過插入'volatility = tr.find('div',{'class':'fxit-eventInfo-vol-c fxit-event-info-desktop')來修改您的代碼以包含'期望波動率'。 ).text'作爲for循環中的最後一個變量。它似乎沒有工作。任何想法爲什麼? – equanimity

+0

它適用於我,一堆1和2。預期產出會是多少? –

+0

預期產出爲:1 =「預期波動率低」,2 =「預期中等波動率」和3 =「預期波動率高 – equanimity