2014-02-15 42 views
0

我試圖檢索div class =「ipo-cell-height」以及諸如2/21/2014和Sundance Energy Australia等公司名稱中的日期。這裏是鏈接到網站http://www.nasdaq.com/markets/ipos/這裏是html。這個代碼塊包含第二個div類=「genTable薄floatL」風格=「WIDTH:315px」Python webscraper和父母名稱問題

<div class="genTable thin floatL" style="width:315px"> 
       <h3 class="table-headtag">Upcoming IPOs</h3> 
       <table><tbody> 
        <tr> 
         <td><div class="ipo-cell-height">2/21/2014</div></td> 
         <td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_0" href="http://www.nasdaq.com/markets/ipos/company/sundance-energy-australia-ltd-672724-74237">SUNDANCE ENERGY AUSTRALIA LTD</a></div></td> 
        </tr> 

        <tr> 
         <td><div class="ipo-cell-height">2/14/2014</div></td> 
         <td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_1" href="http://www.nasdaq.com/markets/ipos/company/inogen-inc-639597-74090">INOGEN INC</a></div></td> 
        </tr> 

        <tr> 
         <td><div class="ipo-cell-height">2/14/2014</div></td> 
         <td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_2" href="http://www.nasdaq.com/markets/ipos/company/semler-scientific-inc-920476-73980">SEMLER SCIENTIFIC, INC.</a></div></td> 
        </tr> 

        <tr> 
         <td><div class="ipo-cell-height">10/9/2013</div></td> 
         <td><div class="ipo-cell-height"><a id="two_column_main_content_rpt_expected_company_3" href="http://www.nasdaq.com/markets/ipos/company/sfx-entertainment-inc-885264-73081">SFX ENTERTAINMENT, INC</a></div></td> 
        </tr> 
       </tbody></table> 

我正在使用的代碼有beautifulsoup,我認爲它需要與parent.name或.contents東西。該代碼僅打印前10個內容。我想我可以得到一些將使用div類作爲parent.name的東西,但「tbody」行不起作用。

from urllib.request import urlopen 
from bs4 import BeautifulSoup 

html = urlopen("http://www.nasdaq.com/markets/ipos/") 
soup = BeautifulSoup(html) 
for data in soup.find_all('td') [0:10]: 
    if data.parent.name == "tr": 
#  if data.parent.name == "tbody": #This line makes it not print anything 
      print (data.text) 

回答

0

您可以創建基於其CSS類的div s的名單,這是使用requestsBeautifulSoup3雖然:

import requests 
from BeautifulSoup import BeautifulSoup 

req = requests.get('http://nasdaq.com/markets/ipos') 
soup = BeautifulSoup(req.content) 

ipo_divs = soup.findAll('div', {'class':'genTable thin floatL'})[0] 
c = ipo_divs.findAll('div', {'class':'ipo-cell-height'}) 

ipos = {c[i].text:c[i + 1].text for i in xrange(0, len(c) - 1, 2)} 
1

一種方法可以遍歷所有<div>元素與class屬性值爲ipo-cell-height,檢查其文本是否與使用正則表達式的日期匹配,然後查找下一個<div>元素並打印這兩個元素的文本。

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re 

html = urlopen("http://www.nasdaq.com/markets/ipos/").read() 
soup = BeautifulSoup(html) 
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}): 
    s = div.string 
    if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s): 
     div_next = div.find_next('div') 
     print('{} - {}'.format(s, div_next.string)) 

運行它想:

python3 script.py 

國債收益率:

2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD 
2/14/2014 - INOGEN INC 
2/14/2014 - SEMLER SCIENTIFIC, INC. 
10/9/2013 - SFX ENTERTAINMENT, INC 
2/13/2014 - IIM GLOBAL CORP 
2/12/2014 - Q2 HOLDINGS, INC. 
2/12/2014 - RIMINI STREET, INC. 
2/12/2014 - MARY FEED & SUPPLIES, INC. 
2/11/2014 - 21ST CENTURY ONCOLOGY HOLDINGS, INC. 
2/3/2014 - GRASSMERE ACQUISITION CORP 
1/31/2014 - APTALIS HOLDINGS INC. 
1/27/2014 - UNITED STATES CURRENCY FUNDS TRUST 
1/22/2014 - CHRYSLER GROUP LLC 
1/10/2014 - GCT SEMICONDUCTOR INC