2013-08-06 40 views
0

因此,我試圖獲取其網頁中包含術語「食譜改編自」或「食譜來自」的範圍中的所有網址。這會複製到文件的所有鏈接,直到7496,然後它吐出HTTPError 404.我做錯了什麼?我試圖實現BeautifulSoup和請求,但我仍然無法實現它的工作。將URL複製到包含特定術語的文件

import urllib2 
with open('recipes.txt', 'w+') as f: 
    for i in range(14477): 
     url = "http://www.tastingtable.com/entry_detail/{}".format(i) 
     page_content = urllib2.urlopen(url).read() 
     if "Recipe adapted from" in page_content: 
      print url 
      f.write(url + '\n') 
     elif "Recipe from" in page_content: 
      print url 
      f.write(url + '\n') 
     else: 
      pass 

回答

1

您試圖抓取的部分網址不存在。或許,忽略例外:

import urllib2 
with open('recipes.txt', 'w+') as f: 
    for i in range(14477): 
     url = "http://www.tastingtable.com/entry_detail/{}".format(i) 
     try: 
      page_content = urllib2.urlopen(url).read() 
     except urllib2.HTTPError as error: 
      if 400 < error.code < 500: 
       continue # not found, unauthorized, etc. 
      raise # other errors we want to know about 
     if "Recipe adapted from" in page_content or "Recipe from" in page_content: 
      print url 
      f.write(url + '\n') 
相關問題