亞馬遜網站刮

我試圖用幻影和python刮亞馬遜價格。我想用美麗的湯來解析它，以獲得書籍的新舊價格，問題是：當我通過phantomjs的請求源時，價格只有0,00，代碼就是這個簡單的測試。亞馬遜網站刮

我是新的網絡抓取，但我不明白，如果是亞馬遜誰有措施，以避免刮價格或我做錯了，因爲我試圖與其他更簡單的頁面，我可以得到的數據我想。

PD我在一個國家不支持使用亞馬遜的API，這就是爲什麼刮刀是necesary

import re 
import urlparse 

from selenium import webdriver 
from bs4 import BeautifulSoup 
from time import sleep 

link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=UTF8&condition=new'#'http://www.amazon.com/gp/product/1119998956' 

class AmzonScraper(object): 
    def __init__(self): 
     self.driver = webdriver.PhantomJS() 
     self.driver.set_window_size(1120, 550) 

    def scrape_prices(self): 
     self.driver.get(link) 
     s = BeautifulSoup(self.driver.page_source) 
     return s 

    def scrape(self): 
     source = self.scrape_prices() 
     print source 
     self.driver.quit() 

if __name__ == '__main__': 
    scraper = TaleoJobScraper() 
    scraper.scrape()

來源

2015-03-31 mch505

只是供參考，你不應該說你正在這樣做，這是對亞馬遜的ToS，你可能會遇到很大的麻煩。 – 2015-03-31 22:11:39

你在哪裏刮東西？ – 2015-03-31 22:17:40

@PadraicCunningham是的，顯然這與網絡抓取完全無關。而類名是'AmzonScraper'，所以它是關於'Amzon'商店 - 一個完全不同的網上商店。 – alecxe 2015-03-31 23:36:34

首先，遵循@Nick貝利的評論，研究使用條款和確保你方沒有違規行爲。

爲了解決這個問題，你需要調整PhantomJS所需的功能：

caps = webdriver.DesiredCapabilities.PHANTOMJS 
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87" 

self.driver = webdriver.PhantomJS(desired_capabilities=caps) 
self.driver.maximize_window()

而且，使它防彈，你可以做一個Custom Expected Condition和等待價格變爲非零：

from selenium.common.exceptions import StaleElementReferenceException 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

class wait_for_price(object): 
    def __init__(self, locator): 
     self.locator = locator 

    def __call__(self, driver): 
     try : 
      element_text = EC._find_element(driver, self.locator).text.strip() 
      return element_text != "0,00" 
     except StaleElementReferenceException: 
      return False

用法：

def scrape_prices(self): 
    self.driver.get(link) 

    WebDriverWait(self.driver, 200).until(wait_for_price((By.CLASS_NAME, "olpOfferPrice"))) 
    s = BeautifulSoup(self.driver.page_source) 

    return s

來源

2015-03-31 22:19:54 alecxe

上設置用戶代理phantomjs到一個正常的瀏覽器很好的回答。既然你說你的國家被亞馬遜封鎖，那麼我想你也需要設置一個代理。

這裏是一個如何在python中用firefox useragent和proxy啓動phantomJS的例子。

from selenium.webdriver import * 
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 
service_args = [ '--proxy=1.1.1.1:port', '--proxy-auth=username:pass' ] 
dcap = dict(DesiredCapabilities.PHANTOMJS) 
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0" 
driver = PhantomJS(desired_capabilities = dcap, service_args=service_args)

其中1.1.1.1是您的代理ip和端口是代理端口。如果您的代理需要驗證，則只需要用戶名和密碼。

來源

2016-03-06 23:29:59

亞馬遜網站刮

回答

相關問題