2017-07-16 31 views
1

我一直在餐廳的食物衛生刮刀。我能夠讓刮刀根據郵政編碼刮掉餐館的名稱,地址和衛生評級。由於食品衛生通過在線圖像顯示,因此我設置了刮刀來讀取「alt =」參數,其中包含食品衛生評分的數值。輸出錯誤的img alt值(Python3,Beautiful Soup 4)

包含IMG ALT標籤我爲食品衛生等級目標的div如下所示:

<div class="rating-image" style="clear: right;"> 
      <a href="/business/abbey-community-college-newtownabbey-antrim-992915.html" title="View Details"> 
       <img src="https://images.scoresonthedoors.org.uk//schemes/735/on_small.png" alt="5 (Very Good)"> 
      </a> 
     </div> 

我已經能夠得到食品衛生的分數給每個餐廳的旁邊輸出。

雖然我的問題是,我注意到一些餐廳旁邊顯示有不正確的閱讀,例如, 3而不是4個食品衛生等級(這是存儲在IMG ALT標籤)

是,上述刮板連接到最初湊的聯繫是

https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=BT367NG&distance=1&search.x=16&search.y=21&gbt_id=0

我認爲它可能有一些在「用於g_data for循環中的項目」內的循環的評級位置處做。

如果我移動

appendhygiene(scrape=[name,address,bleh]) 

一段代碼外循環低於

for rating in ratings: 
       bleh = rating['alt'] 

數據與正確的保健評分正確地刮我發現,唯一的問題是,並非所有記錄被刮掉,在這種情況下它只輸出前9個餐廳。

我很欣賞任何人都可以看看我的代碼,並提供幫助來解決這個問題。

PS,我使用郵政編碼BT367NG來刮擦餐館(如果您測試了腳本,您可以使用它來查看不顯示正確衛生價值的餐廳,例如Lins Garden在網站上是4,並且刮掉的數據顯示a 3)。

我的全代碼如下:

import requests 
import time 
import csv 
import sys 
from bs4 import BeautifulSoup 

hygiene = [] 

def deletelist(): 
    hygiene.clear() 


def savefile(): 
    filename = input("Please input name of file to be saved")   
    with open (filename + '.csv','w') as file: 
     writer=csv.writer(file) 
     writer.writerow(['Address','Town', 'Price', 'Period']) 
     for row in hygiene: 
      writer.writerow(row) 
    print("File Saved Successfully") 


def appendhygiene(scrape): 
    hygiene.append(scrape) 

def makesoup(url): 
    page=requests.get(url) 
    print(url + " scraped successfully") 
    return BeautifulSoup(page.text,"lxml") 


def hygienescrape(g_data, ratings): 
    for item in g_data: 
     try: 
      name = (item.find_all("a", {"class": "name"})[0].text) 
     except: 
      pass 
     try: 
      address = (item.find_all("span", {"class": "address"})[0].text) 
     except: 
      pass 
     try: 
      for rating in ratings: 
        bleh = rating['alt'] 

     except: 
      pass 

     appendhygiene(scrape=[name,address,bleh]) 








def hygieneratings(): 

    search = input("Please enter postcode") 
    soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php?name=&address=&postcode=" + search + "&distance=1&search.x=16&search.y=21&gbt_id=0") 
    hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]')) 

    button_next = soup.find("a", {"rel": "next"}, href=True) 
    while button_next: 
     time.sleep(2)#delay time requests are sent so we don't get kicked by server 
     soup=makesoup(url = "https://www.scoresonthedoors.org.uk/search.php{0}".format(button_next["href"])) 
     hygienescrape(g_data = soup.findAll("div", {"class": "search-result"}), ratings = soup.select('div.rating-image img[alt]')) 

     button_next = soup.find("a", {"rel" : "next"}, href=True) 


def menu(): 
     strs = ('Enter 1 to search Food Hygiene ratings \n' 
      'Enter 2 to Exit\n') 
     choice = input(strs) 
     return int(choice) 

while True:   #use while True 
    choice = menu() 
    if choice == 1: 
     hygieneratings() 
     savefile() 
     deletelist() 
    elif choice == 2: 
     break 
    elif choice == 3: 
     break 

回答

1

看起來你的問題是在這裏:

try: 
    for rating in ratings: 
     bleh = rating['alt'] 

except: 
    pass 

appendhygiene(scrape=[name,address,bleh]) 

什麼這最終做的是附加在每個頁面上的最後一個值。所以這就是爲什麼如果最後的值是「豁免」,所有的值都將被豁免。如果評分是3,那麼該頁面上的所有值都將是3.等等。

你想要的是寫的是這樣的:

try: 
    bleh = item.find_all('img', {'alt': True})[0]['alt'] 
    appendhygiene(scrape=[name,address,bleh]) 

except: 
    pass 

讓每個等級單獨附加,而不是簡單地追加的最後一個。我只是測試它,它似乎工作:)

+0

這工作完美,感謝解釋以及。 :) –

相關問題