使用Python請求庫刪除網頁

我想從一個網頁（鏈接下面）使用請求在Python中獲取一些信息;然而，當我通過python的請求庫進行連接時，我在瀏覽器中看到的HTML數據似乎並不存在。沒有一個xpath查詢返回任何信息。我能夠使用其他網站的請求，如亞馬遜（下面的網站實際上由亞馬遜擁有，但我似乎無法從中獲取任何信息）。使用Python請求庫刪除網頁

url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0' 
user_agent = {'User-agent': 'Mozilla/5.0'} 
page = requests.get(url, headers=user_agent) 
tree = html.fromstring(page.text) 
query = tree.xpath("//span[@id=ourPrice]/text()")

來源

2015-04-17 gtomg

您的'url'不在引號內，因此它不是字符串。 – MattDMo

它似乎在使用javascript和ajax加載產品說明。 – user3557327

事實上，幾乎所有的網站內容都是在javascript XHR調用下構建的。 – felipsmartins

元素是使用JavaScript生成的，則可以使用selenium獲取源，以獲得無頭的瀏覽與phantomjs結合起來：

url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0' 

from selenium import webdriver 

browser = webdriver.PhantomJS() 
browser.get(url) 
_html = browser.page_source 

from bs4 import BeautifulSoup 

print(BeautifulSoup(_html).find("span",{"id":"ourPrice"}).text) 
$50

來源

2015-04-17 20:42:16

這太好了。我完全按照你的建議使用，除了我在phantomjs.exe瀏覽器中添加了一個可執行文件路徑= webdriver.PhantomJS（executable_path = path）這似乎很適合大多數情況。但有時它會返回null，其他時間則爲$ 50。什麼可能導致不一致？ – gtomg

您可能需要添加一個等待。文檔中有一些很好的示例http://selenium-python.readthedocs.org/en/latest/waits.html –

這裏是代碼，我怎麼放棄一個表從一個網站。在那個網站中，他們沒有在表格中定義id或class，所以你不需要放置任何東西。如果id或class表示只是使用html.xpath（'// table [@ id = id_val]/tr'）而不是html.xpath（'// table/tr'）

from lxml import etree 
import urllib 
web = urllib.urlopen("http://www.yourpage.com/") 
html = etree.HTML(web.read()) 
tr_nodes = html.xpath('//table/tr') 
td_content = [tr.xpath('td') for tr in tr_nodes if [td.text for td in tr.xpath('td')][2] == 'Chennai' or [td.text for td in tr.xpath('td')][2] == 'Across India' or 'Chennai' in [td.text for td in tr.xpath('td')][2].split('/') ] 
main_list = [] 
for i in td_content: 
    if i[5].text == 'Freshers' or 'Freshers' in i[5].text.split('/') or '0' in i[5].text.split(' '): 
     sub_list = [td.text for td in i] 
     sub_list.insert(6,'http://yourpage.com/%s'%i[6].xpath('a')[0].get('href')) 
     main_list.append(sub_list) 
print 'main_list',main_list

來源

2016-02-11 12:20:18

使用Python請求庫刪除網頁

回答

相關問題