2012-03-27 37 views
1

我有我想從中抓取屬性的URL列表。新手到Python,所以請原諒。 Windows 7,64位。 Python 3.2。Python - 傳遞網址與HTTPResponse對象

以下代碼有效。 pblist是由包含關鍵字'short_url'的字典組成的列表。

for j in pblist[0:10]: 
    base_url = j['short_url'] 
    if hasattr(BeautifulSoup(urllib.request.urlopen(base_url)), 'head') and \ 
     hasattr(BeautifulSoup(urllib.request.urlopen(base_url)).head, 'title'): 
      print("Has head, title attributes.") 
      try: 
       j['title'] = BeautifulSoup(urllib.request.urlopen(base_url)).head.title.string.encode('utf-8') 
      except AttributeError: 
       print("Encountered attribute error on page, ", base_url) 
       j['title'] = "Attribute error." 
       pass 

以下代碼不會 - 例如,代碼聲稱BeautifulSoup對象沒有頭和標題屬性。

for j in pblist[0:10]: 
     base_url = j['short_url'] 
     page = urllib.request.urlopen(base_url) 
     if hasattr(BeautifulSoup(page), 'head') and \ 
      hasattr(BeautifulSoup(page).head, 'title'): 
       print("Has head, title attributes.") 
       try: 
        j['title'] = BeautifulSoup(urllib.request.urlopen(base_url)).head.title.string.encode('utf-8') 
       except AttributeError: 
        print("Encountered attribute error on page, ", base_url) 
        j['title'] = "Attribute error." 
        pass 

爲什麼?在BeautifulSoup中傳遞url到urllib.request.urlopen並傳遞urllib.request.urlopen返回的HTTPResponse ojbect有什麼區別?

回答

0

urlopen()提供的響應是一個類似文件的對象,這意味着它的默認行爲就像迭代器一樣 - 即一次讀完一遍後,就不會再有數據了(除非你明確地重置它)。

因此,在第二個版本中,BeautifulSoup(page)的第一次調用會從page中讀取所有數據,並且隨後的調用沒有更多數據要讀取。

相反,你可以做的是這樣的:

page = urllib.request.urlopen(base_url) 
page_content = page.read() 
# ... 
BeautifulSoup(page_content) 
# ... 
BeautifulSoup(page_content) 

但即使這是一種低效的。相反,爲什麼不製作一個BeautifulSoup對象並通過它呢?

page = urllib.request.urlopen(base_url) 
soup = BeautifulSoup(page) 
# ... 
# do something with soup 
# ... 
# do something with soup 

您的代碼,修改爲使用單湯對象:

for j in pblist[0:10]: 
     base_url = j['short_url'] 
     page = urllib.request.urlopen(base_url) 
     soup = BeautifulSoup(page) 
     if hasattr(soup, 'head') and \ 
      hasattr(soup.head, 'title'): 
       print("Has head, title attributes.") 
       try: 
        j['title'] = soup.head.title.string.encode('utf-8') 
       except AttributeError: 
        print("Encountered attribute error on page, ", base_url) 
        j['title'] = "Attribute error." 
        pass 
+0

明白了。謝謝琥珀。 – Zack 2012-03-27 22:16:17