Python - 傳遞網址與HTTPResponse對象

我有我想從中抓取屬性的URL列表。新手到Python，所以請原諒。 Windows 7,64位。 Python 3.2。Python - 傳遞網址與HTTPResponse對象

以下代碼有效。 pblist是由包含關鍵字'short_url'的字典組成的列表。

for j in pblist[0:10]: 
    base_url = j['short_url'] 
    if hasattr(BeautifulSoup(urllib.request.urlopen(base_url)), 'head') and \ 
     hasattr(BeautifulSoup(urllib.request.urlopen(base_url)).head, 'title'): 
      print("Has head, title attributes.") 
      try: 
       j['title'] = BeautifulSoup(urllib.request.urlopen(base_url)).head.title.string.encode('utf-8') 
      except AttributeError: 
       print("Encountered attribute error on page, ", base_url) 
       j['title'] = "Attribute error." 
       pass

以下代碼不會 - 例如，代碼聲稱BeautifulSoup對象沒有頭和標題屬性。

for j in pblist[0:10]: 
     base_url = j['short_url'] 
     page = urllib.request.urlopen(base_url) 
     if hasattr(BeautifulSoup(page), 'head') and \ 
      hasattr(BeautifulSoup(page).head, 'title'): 
       print("Has head, title attributes.") 
       try: 
        j['title'] = BeautifulSoup(urllib.request.urlopen(base_url)).head.title.string.encode('utf-8') 
       except AttributeError: 
        print("Encountered attribute error on page, ", base_url) 
        j['title'] = "Attribute error." 
        pass

爲什麼？在BeautifulSoup中傳遞url到urllib.request.urlopen並傳遞urllib.request.urlopen返回的HTTPResponse ojbect有什麼區別？

來源

2012-03-27 Zack

urlopen()提供的響應是一個類似文件的對象，這意味着它的默認行爲就像迭代器一樣 - 即一次讀完一遍後，就不會再有數據了（除非你明確地重置它）。

因此，在第二個版本中，BeautifulSoup(page)的第一次調用會從page中讀取所有數據，並且隨後的調用沒有更多數據要讀取。

相反，你可以做的是這樣的：

page = urllib.request.urlopen(base_url) 
page_content = page.read() 
# ... 
BeautifulSoup(page_content) 
# ... 
BeautifulSoup(page_content)

但即使這是一種低效的。相反，爲什麼不製作一個BeautifulSoup對象並通過它呢？

page = urllib.request.urlopen(base_url) 
soup = BeautifulSoup(page) 
# ... 
# do something with soup 
# ... 
# do something with soup

您的代碼，修改爲使用單湯對象：

for j in pblist[0:10]: 
     base_url = j['short_url'] 
     page = urllib.request.urlopen(base_url) 
     soup = BeautifulSoup(page) 
     if hasattr(soup, 'head') and \ 
      hasattr(soup.head, 'title'): 
       print("Has head, title attributes.") 
       try: 
        j['title'] = soup.head.title.string.encode('utf-8') 
       except AttributeError: 
        print("Encountered attribute error on page, ", base_url) 
        j['title'] = "Attribute error." 
        pass

來源

2012-03-27 22:14:52 Amber

明白了。謝謝琥珀。 – Zack 2012-03-27 22:16:17

Python - 傳遞網址與HTTPResponse對象

回答

相關問題