網絡抓取，獲取空列表

我很難找出一個正確的路徑與我的網頁抓取代碼。網絡抓取，獲取空列表

我想從http://financials.morningstar.com/company-profile/c.action?t=AAPL刮取不同的信息。我已經嘗試了幾條路徑，有些似乎有效，有些則沒有。我感興趣的CIK下的操作細節

page = requests.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL') 
tree=html.fromstring(page.text) 


#desc = tree.xpath('//div[@class="r_title"]/span[@class="gry"]/text()') #works 

#desc = tree.xpath('//div[@class="wrapper"]//div[@class="headerwrap"]//div[@class="h_Logo"]//div[@class="h_Logo_row1"]//div[@class="greeter"]/text()') #works 

#desc = tree.xpath('//div[@id="OAS_TopLeft"]//script[@type="text/javascript"]/text()') #works 

desc = tree.xpath('//div[@class="col2"]//div[@id="OperationDetails"]//table[@class="r_table1 r_txt2"]//tbody//tr//th[@class="row_lbl"]/text()')

我想不通的最後一條路徑。這似乎是我正確地遵循路徑，但我得到空列表。

來源

2015-10-14 AK9309

最後一個元素th，它是html中的表頭，因此您可能需要將其更改爲用於表數據的td。 – postelrich

http://stackoverflow.com/questions/24163745/beginner-to-scraping-keep-on-getting-empty-lists這可能是一個類似的問題，你的看看 –

http://stackoverflow.com/questions/ 33110734/xpath-not-working-for-screen-scraping/33111061？noredirect = 1＃comment54037557_33111061這裏是一個像這樣的html錯誤導致一個空分析 – rebeling

問題是操作細節分別加載了額外的GET請求。模擬它在你的代碼維護一個網絡的scrapin會議：

import requests 
from lxml import html 


with requests.Session() as session: 
    page = session.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL') 
    tree = html.fromstring(page.text) 

    # get the operational details 
    response = session.get("http://financials.morningstar.com/company-profile/component.action", params={ 
     "component": "OperationDetails", 
     "t": "XNAS:AAPL", 
     "region": "usa", 
     "culture": "en-US", 
     "cur": "", 
     "_": "1444848178406" 
    }) 

    tree_details = html.fromstring(response.content) 
    print tree_details.xpath('.//th[@class="row_lbl"]//text()')

老答案：

只是，你應該從表達中刪除tbody：

//div[@class="col2"]//div[@id="OperationDetails"]//table[@class="r_table1 r_txt2"]//tr//th[@class="row_lbl"]/text()

tbody是由瀏覽器插入的元素在表中定義數據行。

來源

2015-10-14 18:32:04 alecxe

我仍然得到一個空列表。我相信我的問題是表中有幾個'tr'。所以我應該給它一個'tr'的數字，就像 '// table [@ class =「r_table1 r_txt2」] // tr [3] // th [@ class =「row_lbl」]/text（）'。但我仍然收到一個空的列表 – AK9309

@ AK9309問題在於操作細節會通過對http：// financials.morningstar.com/company-profile/component.action'的額外獲取請求動態加載。 – alecxe

@ AK9309已更新。一探究竟。 – alecxe

網絡抓取，獲取空列表

回答

相關問題