我正在寫一個web刮板。我本來可以使用scrapy,但決定從頭開始寫,所以我可以練習。Python多線程使用請求和BeautifulSoup
我創建了一個成功使用請求和BeautifulSoup的工具。它瀏覽大約135頁,每頁有12個條目,抓取鏈接,然後從鏈接目的地獲取信息。最後,它將所有內容寫入CSV文件。它只抓取字符串,它不會下載任何圖像或類似的東西......現在。
問題?這很慢。大約需要5秒鐘的時間才能抓取所有的東西,僅僅是從一頁的內容中獲取,所以135次的時間大約是11分鐘。
所以我的問題是如何在我的代碼中實現線程,以便更快地獲取數據。
下面的代碼:
import requests
from bs4 import BeautifulSoup
import re
import csv
def get_actor_dict_from_html(url, html):
soup = BeautifulSoup(html, "html.parser")
#There must be a better way to handle this, but let's assign a NULL value to all upcoming variables.
profileName = profileImage = profileHeight = profileWeight = 'NULL'
#Let's get the name and image..
profileName = str.strip(soup.find('h1').get_text())
profileImage = "http://images.host.com/actors/" + re.findall(r'\d+', url)[0] + "/actor-large.jpg"
#Now the rest of the stuff..
try:
profileHeight = soup.find('a', {"title": "Height"}).get_text()
except:
pass
try:
profileWeight = soup.find('a', {"title": "Weight"}).get_text()
except:
pass
return {
'Name': profileName,
'ImageUrl': profileImage,
'Height': profileHeight,
'Weight': profileWeight,
}
def lotta_downloads():
output = open("/tmp/export.csv", 'w', newline='')
wr = csv.DictWriter(output, ['Name','ImageUrl','Height','Weight'], delimiter=',')
wr.writeheader()
for i in range(135):
url = "http://www.host.com/actors/all-actors/name/{}/".format(i)
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all("div", { "class" : "card-image" })
for a in links:
for url in a.find_all('a'):
url = "http://www.host.com" + url['href']
print(url)
response = requests.get(url)
html = response.content
actor_dict = get_actor_dict_from_html(url, html)
wr.writerow(actor_dict)
print('All Done!')
if __name__ == "__main__":
lotta_downloads()
謝謝!
通常,你最好不要重新發明輪子和使用像一個'Scrapy'網絡框架。 – alecxe