2017-02-15 41 views
0

我一直在製作一個簡單的刮刀,使用美麗的湯根據用戶輸入的郵編獲得食品衛生評級。該代碼正常工作,並正確地從URL中獲取結果。Python - 顯示來自所有頁面的結果不僅僅是第一頁(美麗的湯)

我需要幫助的是如何讓所有結果顯示,而不僅僅是第一頁的結果。

我的代碼如下:

import requests 
from bs4 import BeautifulSoup 

pc = input("Please enter postcode") 

url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode="+pc+"&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt" 
r = requests.get(url) 

soup = BeautifulSoup(r.content, "lxml") 
g_data = soup.findAll("div", {"class": "search-result"}) 

for item in g_data: 
    print (item.find_all("a", {"class": "name"})[0].text) 
try: 
    print (item.find_all("span", {"class": "address"})[0].text) 
except: 
    pass 
try: 
    print (item.find_all("div", {"class": "rating-image"})[0].text) 
except: 
    pass 

我已經通過查看網址發現,顯示的頁面是依賴於所謂的頁面

https://www.scoresonthedoors.org.uk/search.php?award_sort=ALPHA&name=&address=BT147AL&x=0&y=0&page=2#results 

分頁代碼的URL字符串變量Next Page按鈕是:

<a style="float: right" href="?award_sort=ALPHA&amp;name=&amp;address=BT147AL&amp;x=0&amp;y=0&amp;page=3#results" rel="next " title="Go forward one page">Next <i class="fa fa-arrow-right fa-3"></i></a> 

有沒有一種方法,我可以讓我的代碼,找出輸出H多頁結果呈現,然後從這些頁面中獲取結果?

最好的解決方案是讓代碼改變URL字符串以每次更改「page =」(例如for循環),或者有辦法使用分頁鏈接代碼中的信息來查找解決方案?

非常感謝的人誰提供幫助或看這個問題

回答

1

你居然打算以正確的方式。生成分頁的URL來預先刮取是一個好方法。

我其實寫了幾乎整個代碼。你要看的是find_max_page()函數首先包含從分頁字符串中獲取最大頁面。用這個數字,你可以生成所有需要刮取的網址,然後一個接一個地刮。

查看下面的代碼,它幾乎都在那裏。

import requests 
from bs4 import BeautifulSoup 


class RestaurantScraper(object): 

    def __init__(self, pc): 
     self.pc = pc  # the input postcode 
     self.max_page = self.find_max_page()  # The number of page available 
     self.restaurants = list()  # the final list of restaurants where the scrape data will at the end of process 

    def run(self): 
     for url in self.generate_pages_to_scrape(): 
      restaurants_from_url = self.scrape_page(url) 
      self.restaurants += restaurants_from_url  # we increment the restaurants to the global restaurants list 

    def create_url(self): 
     """ 
     Create a core url to scrape 
     :return: A url without pagination (= page 1) 
     """ 
     return "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + self.pc + \ 
       "&distance=1&search.x=8&search.y=6&gbt_id=0&award_score=&award_range=gt" 

    def create_paginated_url(self, page_number): 
     """ 
     Create a paginated url 
     :param page_number: pagination (integer) 
     :return: A url paginated 
     """ 
     return self.create_url() + "&page={}".format(str(page_number)) 

    def find_max_page(self): 
     """ 
     Function to find the number of pages for a specific search. 
     :return: The number of pages (integer) 
     """ 
     r = requests.get(self.create_url()) 
     soup = BeautifulSoup(r.content, "lxml") 
     pagination_soup = soup.findAll("div", {"id": "paginator"}) 
     pagination = pagination_soup[0] 
     page_text = pagination("p")[0].text 
     return int(page_text.replace('Page 1 of ', '')) 

    def generate_pages_to_scrape(self): 
     """ 
     Generate all the paginated url using the max_page attribute previously scraped. 
     :return: List of urls 
     """ 
     return [self.create_paginated_url(page_number) for page_number in range(1, self.max_page + 1)] 

    def scrape_page(self, url): 
     """ 
     This is coming from your original code snippet. This probably need a bit of work, but you get the idea. 
     :param url: Url to scrape and get data from. 
     :return: 
     """ 
     r = requests.get(url) 
     soup = BeautifulSoup(r.content, "lxml") 
     g_data = soup.findAll("div", {"class": "search-result"}) 

     restaurants = list() 
     for item in g_data: 
      name = item.find_all("a", {"class": "name"})[0].text 
      restaurants.append(name) 
      try: 
       print item.find_all("span", {"class": "address"})[0].text 
      except: 
       pass 
      try: 
       print item.find_all("div", {"class": "rating-image"})[0].text 
      except: 
       pass 
     return restaurants 


if __name__ == '__main__': 
    pc = input('Give your post code') 
    scraper = RestaurantScraper(pc) 
    scraper.run() 
    print "{} restaurants scraped".format(str(len(scraper.restaurants))) 
+0

scrape_page函數是您的原始代碼。它可以使用一些工作。只要確保這個功能已經建好。其他一切都已準備就緒。有關此代碼的任何問題,請告知我。 –

+0

感謝Philippe,此代碼工作完美。 –