0
我使用Python 3.3的工作同級車中第一個div標籤的內容獲取與本網站: http://www.nasdaq.com/markets/ipos/的Python webscraping和
我的目標是隻讀是在即將到來的IPO的公司。它在div標籤中,div class =「genTable thin floatL」這個類有兩個,目標數據是第一個。
這裏是我的代碼
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
我想它僅返回
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
但它打印所有的div類與re.match規範和多次爲好。我嘗試在divparent循環中插入[0]來只檢索第一個,但是這會導致重複問題。
編輯:這是根據warunsl解決方案更新的代碼。這工作。
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
這工作得很好。謝謝! – user2859603