網頁抓取不檢索整個文檔urllib或請求

我試圖從lowes.com刮取產品信息。我的測試是專門爲此產品AirStone 8-sq ft Autumn Mountain Faux Stone Veneer。當我在沒有啓用JavaScript的情況下訪問該頁面（爲了確保我沒有看到urllib /請求可能無法獲取的內容），我明確地爲此項目獲得價格，但是當我嘗試使用上面的任一包時，我缺少幾個部分網頁。網頁抓取不檢索整個文檔urllib或請求

它恰好發生這些部分是我需要的刮（部分價格信息，神奇的一切仍然可用）。我寧願不爲了速度而使用硒。我目前使用的請求和urllib的期待正是如此

通用項目

from urlopen import Request, urlopen 
import requests # switch as needed with urlopen 
import gzip # manual deflation required with Request object urlopen or so I've found 

url = "https://www.lowes.com/pd/AirStone-8-sq-ft-Autumn-Mountain-Faux-Stone-Veneer/50247201" 
headers = { 
     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", 
     "Accept-Encoding": "gzip, deflate, br", 
     "Accept-Language": "en-US,en;q=0.8", 
     "Cache-Control": "no-cache", 
     "Connection": "keep-alive", 
     "DNT": "1", 
     # "Host": "www.lowes.com", Tried, no difference 
     "Pragma": "no-cache", 
     # "Referer": "https://www.lowes.com/", Tried, no difference 
     "Upgrade-Insecure-Requests": "1", 
     "User-Agent": "Mozilla/5.0 (Windows NT 6.1 Win64 x64) AppleWebKit/537.36 (KHTML," 
     " like Gecko) Chrome/59.0.3071.115 Safari/537.36" # <=- Tried placing all on one line, didn't make a difference 
    }

的urlopen

req = Request(url, None, headers) 
page = gzip.decompress(urlopen(req).read()).decode('utf-8') 
with open("content.txt", "w") as f: 
    f.write(page) # <=- missing the 59.97 price tag anywhere in the document :(

要求

sessions = requests.Session() 
page = sessions.get(self.url, headers=headers) 

with open("content.txt", "w") as f: 
    f.write(page) # <=- Also missing the 59.97 price tag anywhere in the document :'(

所以問題是，我失去了什麼？有沒有理由失蹤？這不是JavaScript相關的，因爲我在嘗試刮取數據之前故意禁用它，因爲我發現這是很多時候的問題。

任何幫助將不勝感激。

來源

2017-08-03 Akidi

您要返回的頁面顯示「輸入您的位置以獲取定價和可用性」 - 在實際的瀏覽器中，您可能從上次訪問中獲得了一個Cookie，用於向您的網站提供您的位置信息。我確信可以使用任一請求方法包含cookie，但我不知道詳細信息。 – jasonharper

這是一個驚人的發現，我以爲我已經刪除了cookies。顯然我沒有。這是非常讚賞良好的互聯網居民:) Tally-Ho，看看我能找出什麼 – Akidi

根據jasonharper的評論。 Cookies最終成爲答案。找到合適的人允許我提取所有必要的數據。

簡而言之，在嘗試抓取網站之前，如果沒有其他原因，請務必禁用/刪除Cookie，而不是確保您看到腳本所看到的內容。

對於那些好奇的具體cookie是{「sn」：「####」}（商店號碼）你可以簡單地選擇一家商店，並懸停在它啓用JavaScript的查看它鏈接到的網址找出店面號碼。更改爲適合

來源

2017-08-03 17:08:20 Akidi

網頁抓取不檢索整個文檔urllib或請求

回答

相關問題