使用BeautifulSoup4刮取HTML表格數據

我試圖從Yahoo Finance上的摘要頁面上刮掉「市值」數據。使用BeautifulSoup4刮取HTML表格數據

Chrome的HTML數據檢查工具，如下所示：

我的代碼是：

from urllib.request import urlopen 
from bs4 import BeautifulSoup 

sp500short = ['a', 'aa', 'aapl', 'abbv', 'abc', 'abt', 'aci', 'acn', 'act', 'adbe', 'adi', 'adm', 'adp'] 
dowJones = ['mmm', 'axp', 'aapl', 'ba', 'cat', 'cvx', 'csco', 'ko', 'dd', 'xom', 'ge', 'gs', 'hd', 'intc', 'ibm', 'jpm', 'jnj', 'mcd', 'mrk', 'msft', 'nke', 'pfe', 'pg', 'trv', 'utx', 'unh', 'vz', 'v', 'wmt', 'dis'] 


def stockScreener(): 

    for ticker in sp500short: 
     searchSummary = "http://finance.yahoo.com/q?s="+ticker 
     summary = urlopen(searchSummary) 
     summaryHtml = summary.read() 
     summarySoup = BeautifulSoup(summaryHtml, "html.parser") 

     try: 
      marketCap = summarySoup.find("th scope", text="Market Cap:").find_next_sibling("td").text 

     except: 
      marketCap = "There is no data for this company" 

     if marketCap == "There is no data for this company": 
      print(ticker+" "+marketCap)    
     else: 
      output = marketCap[:-1] 
      print(ticker + str(output)) 

stockScreener()

有什麼不對我.find()電話？

來源

2016-02-23 Philip McQuitty

你是如此接近 - 你只需要來自行刪除scope：

marketCap = summarySoup.find("th scope", text="Market Cap:").find_next_sibling("td").text

它應該是這樣的：

marketCap = summarySoup.find("th", text="Market Cap:").find_next_sibling("td").text

scope是<td>標籤的屬性，你'試圖得到，而不是標籤本身的一部分

來源

2016-02-23 21:12:33 wpercy

使用BeautifulSoup4刮取HTML表格數據

回答

相關問題