用Python刮掉雅虎財務損益表

我試圖用Python從Yahoo Finance的損益表中刮取數據。具體來說，我們假設我想要most recent figure of Net Income of Apple。用Python刮掉雅虎財務損益表

數據結構在一堆嵌套的HTML表格中。我正在使用requests模塊來訪問它並檢索HTML。

我使用BeautifulSoup 4篩選HTML結構，但我無法弄清楚如何得到這個數字。

Here是Firefox的分析截圖。

我迄今爲止代碼：

from bs4 import BeautifulSoup 
import requests 

myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual" 
html = requests.get(myurl).content 
soup = BeautifulSoup(html)

我嘗試使用

all_strong = soup.find_all("strong")

然後拿到第17個元素，而這恰好是包含我想圖中的一個，但是這從似乎遠優雅。事情是這樣的：

all_strong[16].parent.next_sibling 
...

當然，我們的目標是用BeautifulSoup來搜索名稱的身影，我需要的（在這種情況下，「淨利潤」），然後搶在數字本身 HTML表格的同一行。

我真的很感激就如何解決這個任何想法，記住，我想申請的解決方案來檢索一堆其他雅虎財經網頁等數據。

SOLUTION /擴展：

通過@wilbur該解決方案如下工作，我在擴大它能夠得到的值上的金融頁面的任何提供任何圖（即Income Statement ，Balance Sheet,Cash Flow Statement）任何上市公司。 My功能如下：

def periodic_figure_values(soup, yahoo_figure): 

    values = [] 
    pattern = re.compile(yahoo_figure) 

    title = soup.find("strong", text=pattern) # works for the figures printed in bold 
    if title: 
     row = title.parent.parent 
    else: 
     title = soup.find("td", text=pattern) # works for any other available figure 
     if title: 
      row = title.parent 
     else: 
      sys.exit("Invalid figure '" + yahoo_figure + "' passed.") 

    cells = row.find_all("td")[1:] # exclude the <td> with figure name 
    for cell in cells: 
     if cell.text.strip() != yahoo_figure: # needed because some figures are indented 
      str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "") 
      if str_value == "-": 
       str_value = 0 
      value = int(str_value) * 1000 
      values.append(value) 

    return values

的yahoo_figure變量是一個字符串。顯然，這必須與Yahoo Finance上使用的圖形名稱完全相同。要通過soup變量，我用下面的函數首先：

def financials_soup(ticker_symbol, statement="is", quarterly=False): 

    if statement == "is" or statement == "bs" or statement == "cf": 
     url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol 
     if not quarterly: 
      url += "&annual" 
     return BeautifulSoup(requests.get(url).text, "html.parser") 

    return sys.exit("Invalid financial statement code '" + statement + "' passed.")

使用範例 - 我想從最後一個可用的損益表得到蘋果公司的所得稅費用：

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))

輸出：[19121000000, 13973000000, 13118000000]

你也可以得到來自soup期間的結束的日期，並創建一個字典磨片日期是關鍵，數字是值，但這會使這篇文章太長。到目前爲止，這似乎爲我工作，但我總是感謝建設性的批評。

來源

2016-02-16 JohnGalt

這是由多一點困難，因爲「淨收入」，在封閉在一個<strong>標籤，如此忍受我，但我想這樣的作品：

import re, requests 
from bs4 import BeautifulSoup 

url = 'https://finance.yahoo.com/q/is?s=AAPL&annual' 
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser') 
pattern = re.compile('Net Income') 

title = soup.find('strong', text=pattern) 
row = title.parent.parent # yes, yes, I know it's not the prettiest 
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income' 

values = [ c.text.strip() for c in cells ]

values，在這種情況下，將包含在「淨收入」行三個表格單元格（和，我想補充，可以很容易地轉換成整數的 - 我只是喜歡他們保持了「」字符串）

In [10]: values 
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']

當我在測試它Alphabet（GOOG） - 它不起作用，因爲它們不顯示I ncome聲明我相信（https://finance.yahoo.com/q/is?s=GOOG&annual），但是當我檢查Facebook（FB）時，數值正確返回（https://finance.yahoo.com/q/is?s=FB&annual）。

如果你想創建一個更加動態的腳本，你可以使用字符串格式化與任何你想要的股票代碼格式化的URL，就像這樣：

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol 
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

來源

2016-02-16 20:23:02 wpercy

非常感謝。迄今爲止效果很好。現在我只需要讓它變得更有活力。不只是關於股票，還包括同一股票的其他財務數據，以及檢查最近的數據等等。但這是一個很好的開始。 – JohnGalt

用Python刮掉雅虎財務損益表

回答

相關問題