Python/lxml：如何在HTML表格中捕捉一行？

對於我的股票篩選工具，我必須在腳本中從BeautifulSoup切換到lxml。在我的Python腳本下載了我需要處理的網頁之後，BeautifulSoup能夠正確解析它們，但是這個過程太慢了。解析一隻股票的資產負債表，損益表和現金流量表，需要花費大約10秒鐘的時間，而且由於我的腳本有超過5000只股票需要分析，所以這是不可接受的。Python/lxml：如何在HTML表格中捕捉一行？

根據一些基準測試（http://www.crummy.com/2012/1/22/0），lxml比BeautifulSoup快近100倍。因此，lxml應該能夠在10分鐘內完成一項需要14小時BeautifuSoup的工作。

如何使用HTML捕獲HTML表格中行的內容？我的腳本已經下載並需要解析HTML頁面的一個例子是在http://www.smartmoney.com/quote/FAST/?story=financials&opt=YB

使用BeautifulSoup解析這個HTML表格的源代碼是：

url_local = local_balancesheet (symbol_input) 
    url_local = "file://" + url_local 
    page = urllib2.urlopen (url_local) 
    soup = BeautifulSoup (page) 
    soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent 
    list_output = soup_line_item.findAll('td') # List of elements

如果我在尋找現金和短期投資，title_input =「現金&短期投資」。

如何在lxml中執行相同的功能？

來源

2012-11-28 jhsu802701

從其他谷歌搜索，它看起來像要走的路是lxml，etree，HTMLparser和xpath。 xpath具有指定要查找的代碼。如何讓xpath在包含特定文本的HTML表中查找行？ – jhsu802701

你可以在BeautifulSoup中使用lxml解析器，所以我不知道你爲什麼要這樣做。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

soup = BeautifulSoup(markup, "lxml")

編輯：下面是一些代碼一起玩。這對我來說大約需要六秒鐘。

def get_page_data(url): 
    f = urllib2.urlopen(url) 
    soup = BeautifulSoup(f, 'lxml') 
    f.close() 
    trs = soup.findAll('tr') 
    data = {} 
    for tr in trs: 
     try: 
      if tr.div.text.strip() in ('Cash & Short Term Investments', 'Property, Plant & Equipment - Gross', 
           'Total Liabilities', 'Preferred Stock (Carrying Value)'): 
       data[tr.div.text] = [int(''.join(e.text.strip().split(','))) for e in tr.findAll('td')] 
     except (AttributeError, ValueError): 
      # headers dont have a tr tag, and thus raises AttributeError 
      # 'Fiscal Year Ending in 2011' raises ValueError 
      pass 
    return data

來源

2012-11-28 23:07:03 kreativitea

我用BeautifulSoup4替換了BeautifulSoup 3，並使用上述命令在BeautifulSoup中調用lxml。不幸的是，我看不到速度的顯着提高。我如何比以前使用的速度提高100倍（甚至10倍）的速度？ – jhsu802701

然後解析不是問題，它是你的分析和搜索。你只解析每個頁面*一次*，解析一個頁面應該用lxml少於一秒。你需要哪些數字才能離開該頁面？也許我可以想出辦法快速做到這一點。 – kreativitea

我需要捕捉的數據在資產負債表頁面上有：現金和短期投資;物業，廠房及設備 - 毛額;負債總額;和優先股（持有價值）。 – jhsu802701

Python/lxml：如何在HTML表格中捕捉一行？

回答

相關問題