2017-03-17 54 views
0

嗨,我試圖解析亞馬遜的頁面書細節,讓我用美麗的湯美麗的湯解析亞馬遜頁面

鏈接:https://www.amazon.com/Dogs-Purpose-Novel-Humans/dp/0765326264/ref=sr_1_1?s=electronics&ie=UTF8&qid=1489776209&sr=1-1&keywords=books

from bs4 import BeautifulSoup 
import requests 

url = raw_input("Enter a website to extract the URL's from: ") 
r = requests.get(url) 

data = r.text 

soup = BeautifulSoup(data, "lxml") 

#Grab book details 
print soup.find("table", {"id": "productDetailsTable" }) 

但當我嘗試這個代碼,我得到無因此,我確定id productDetailsTable存在,並且當我嘗試使用虛擬html運行此代碼時,它僅適用於沒有url的代碼?

+0

任何你不會僅僅使用亞馬遜API的理由? – Cfreak

+0

正在嘗試獲取其他產品的具體產品詳細信息,這些產品在其API中無法真正訪問,但出現在他們的html頁面上:( –

回答

1

我沒有看到https://www.amazon.com

我不得不這樣做https://www.amazon.com/,以便接收HTML數據productDetailsTable。

這裏是我稍作修改的Python 3代碼。

from bs4 import BeautifulSoup 
import requests 

url = input("Enter a website to extract the URL's from: ") 
r = requests.get(url) 

data = r.text 

soup = BeautifulSoup(data, "lxml") 

print(soup.text) 

它打印頁面的html。

你會注意到亞馬遜很聰明。該html包括機器人檢查:

if (true === true) { 
var ue_t0 = (+ new Date()), 
    ue_csm = window, 
    ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } }, 
    ue_furl = "fls-na.amazon.com", 
    ue_mid = "ATVPDKIKX0DER", 
    ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1], 
    ue_sn = "opfcaptcha.amazon.com", 
    ue_id = 'R8D7EEN5FVS7RWC2M549'; 
} 
Enter the characters you see below 
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies. 

它讓你不讀亞馬遜的頁面。你必須做更多,可能與requests,幷包括headerscookie信息。

+0

哦rip,好的,謝謝讓我更有意義 –