2017-09-14 71 views
0

我想抓取一個網站,但是當我運行這段代碼時,它只打印一半的數據(包括評論數據)。這裏是我的腳本:BeautifulSoup不抓取所有數據

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

inputfile = "Chicago.csv" 
f = open(inputfile, "w") 
Headers = "Name, Link\n" 
f.write(Headers) 

url = "https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228" 
html = urlopen(url) 
soup = BeautifulSoup(html, "html.parser") 

page_details = soup.find("dl", {"class":"boccat"}) 
Readers = page_details.find_all("a") 

for i in Readers: 
    poll = i.contents[0] 
    link = i['href'] 
    print(poll) 
    print(link) 
    f.write("{}".format(poll) + ",https://www.chicagoreader.com{}".format(link)+ "\n") 
f.close() 
  1. 是我的腳本風格錯了嗎?
  2. 如何縮短代碼?
  3. 何時使用find_allfind未獲取屬性錯誤。我閱讀文檔,但不明白。

回答

0

爲了縮短代碼,可以切換到請求庫。它很容易使用和精確。如果你想使它更短,你可以使用cssselect。

find選擇容器和find_all在for循環中選擇該容器的單個項目。下面是完整的代碼:

from bs4 import BeautifulSoup 
import csv ; import requests 

outfile = open("chicagoreader.csv","w",newline='') 
writer = csv.writer(outfile) 
writer.writerow(["Name","Link"]) 

base = "https://www.chicagoreader.com" 

response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228") 
soup = BeautifulSoup(response.text, "lxml") 
for item in soup.select(".boccat dd a"): 
    writer.writerow([item.text,base + item.get('href')]) 
    print(item.text,base + item.get('href')) 

或者使用find和find_all:

from bs4 import BeautifulSoup 
import requests 

base = "https://www.chicagoreader.com" 

response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228") 
soup = BeautifulSoup(response.text, "lxml") 
for items in soup.find("dl",{"class":"boccat"}).find_all("dd"): 
    item = items.find_all("a")[0] 
    print(item.text, base + item.get("href")) 
+0

嗨沙欣,可以請你提供find_all的和找到一個簡短的例子..? –

+0

@ Mr.Bones,我已經給出了一個find和find_all的例子。看上面。 – SIM