使用python從網站抓取多個網頁

我想知道如何從一個網站使用美麗的湯爲一個城市（例如倫敦）抓取多個不同的網頁，而不必一遍又一遍地重複我的代碼。使用python從網站抓取多個網頁

我的目標是理想的第一抓取與一個城市

下面的所有頁面，我的代碼：

session = requests.Session() 
session.cookies.get_dict() 
url = 'http://www.citydis.com' 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} 
response = session.get(url, headers=headers) 

soup = BeautifulSoup(response.content, "html.parser") 
metaConfig = soup.find("meta", property="configuration") 


jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=0" 
response = session.get(jsonUrl, headers=headers) 
js_dict = (json.loads(response.content.decode('utf-8'))) 

for item in js_dict: 
    headers = js_dict['searchResults']["tours"] 
    prices = js_dict['searchResults']["tours"] 

for title, price in zip(headers, prices): 
    title_final = title.get("title") 
    price_final = price.get("price")["original"] 

print("Header: " + title_final + " | " + "Price: " + price_final)

輸出爲下列之一：

Header: London Travelcard: 1 Tag lang unbegrenzt reisen | Price: 19,44 € 
Header: 105 Minuten London bei Nacht im verdecklosen Bus | Price: 21,21 € 
Header: Ivory House London: 4 Stunden mittelalterliches Bankett| Price: 58,92 € 
Header: London: Themse Dinner Cruise | Price: 96,62 €

它給我只返回第一頁的結果（4結果），但我想要獲得倫敦的所有結果（必須超過200個結果）

你能給我什麼建議嗎？我想，我都數不過來了就jsonURL的網頁，但不知道該怎麼辦呢

UPDATE

感謝幫助，I'm抽到了一步。

在這種情況下，我只能抓取一頁（頁面= 0），但我想抓取前10頁。因此，我的做法是以下幾點：從代碼

相關片段：

soup = bs4.BeautifulSoup(response.content, "html.parser") 
metaConfig = soup.find("meta", property="configuration") 

page = 0 
while page <= 11: 
    page += 1 

    jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=" + str(page) 
    response = session.get(jsonUrl, headers=headers) 
    js_dict = (json.loads(response.content.decode('utf-8'))) 


    for item in js_dict: 
     headers = js_dict['searchResults']["tours"] 
     prices = js_dict['searchResults']["tours"] 

     for title, price in zip(headers, prices): 
      title_final = title.get("title") 
      price_final = price.get("price")["original"] 

      print("Header: " + title_final + " | " + "Price: " + price_final)

I'm得到結果返回一個特定網頁，但不是全部。除此之外，我還會收到一條錯誤消息。這與我爲什麼沒有取回所有結果有關嗎？

輸出：

Traceback (most recent call last): 
File "C:/Users/Scripts/new.py", line 19, in <module> 
AttributeError: 'list' object has no attribute 'update'

感謝您的幫助

來源

2017-04-16 Serious Ruffy

如果你想正確的抓取網頁的方式尋找'xpaths'。它會使你的代碼減少很多，也許在你上面做的最多5行。它是做任何與抓取和抓取有關的標準方式。 – anekix

感謝您的信息。將嘗試一下。儘管如此，你能否提供一些反饋，告訴我如何用上述方法解決上述問題？ –

你真的應該確保你的代碼示例是完整的（你丟失了一些進口）和語法正確（代碼包含縮進問題）。在試圖做出一個工作示例時，我提出了以下內容。

import requests, json, bs4 
session = requests.Session() 
session.cookies.get_dict() 
url = 'http://www.getyourguide.de' 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} 
response = session.get(url, headers=headers) 

soup = bs4.BeautifulSoup(response.content, "html.parser") 
metaConfig = soup.find("meta", property="configuration") 
metaConfigTxt = metaConfig["content"] 
csrf = json.loads(metaConfigTxt)["pageToken"] 


jsonUrl = "https://www.getyourguide.de/s/results.json?&q=London& customerSearch=1&page=0" 
headers.update({'X-Csrf-Token': csrf}) 
response = session.get(jsonUrl, headers=headers) 
js_dict = (json.loads(response.content.decode('utf-8'))) 
print(js_dict.keys()) 

for item in js_dict: 
     headers = js_dict['searchResults']["tours"] 
     prices = js_dict['searchResults']["tours"] 

     for title, price in zip(headers, prices): 
      title_final = title.get("title") 
      price_final = price.get("price")["original"] 

      print("Header: " + title_final + " | " + "Price: " + price_final)

這給了我四個以上的結果。

一般而言，您會發現很多返回JSON的網站都會對他們的回覆進行分頁，每頁提供固定數量的結果。在這些情況下，除最後一頁以外的每個頁面通常都會包含一個鍵，其值將爲您提供下一頁的URL。在頁面上循環時很簡單，當您檢測到該鍵不存在時，break不在循環中。

來源

2017-04-17 09:26:22 holdenweb

非常感謝你。將考慮您的反饋。在這種情況下，我只能抓取一頁（頁面= 0），但我想抓取前10頁。我在我的第一篇初始文章中發佈了我的方法。希望，你可以引導我找到正確的解決方案。並感謝您的耐心:) –

很高興。我認爲任何進一步的進展將取決於網站的具體情況，因此可能會落在Stackoverflow之外 – holdenweb

使用python從網站抓取多個網頁

回答

相關問題