0

我正在寫一個web刮板。我本來可以使用scrapy,但決定從頭開始寫,所以我可以練習。Python多線程使用請求和BeautifulSoup

我創建了一個成功使用請求和BeautifulSoup的工具。它瀏覽大約135頁,每頁有12個條目,抓取鏈接,然後從鏈接目的地獲取信息。最後,它將所有內容寫入CSV文件。它只抓取字符串,它不會下載任何圖像或類似的東西......現在。

問題?這很慢。大約需要5秒鐘的時間才能抓取所有的東西,僅僅是從一頁的內容中獲取,所以135次的時間大約是11分鐘。

所以我的問題是如何在我的代碼中實現線程,以便更快地獲取數據。

下面的代碼:

import requests 
from bs4 import BeautifulSoup 
import re 
import csv 


def get_actor_dict_from_html(url, html): 
    soup = BeautifulSoup(html, "html.parser") 

    #There must be a better way to handle this, but let's assign a NULL value to all upcoming variables. 
    profileName = profileImage = profileHeight = profileWeight = 'NULL' 

    #Let's get the name and image.. 
    profileName = str.strip(soup.find('h1').get_text()) 
    profileImage = "http://images.host.com/actors/" + re.findall(r'\d+', url)[0] + "/actor-large.jpg" 

    #Now the rest of the stuff.. 
    try: 
     profileHeight = soup.find('a', {"title": "Height"}).get_text() 
    except: 
     pass 
    try: 
     profileWeight = soup.find('a', {"title": "Weight"}).get_text() 
    except: 
     pass 

    return { 
     'Name': profileName, 
     'ImageUrl': profileImage, 
     'Height': profileHeight, 
     'Weight': profileWeight, 
     } 


def lotta_downloads(): 
    output = open("/tmp/export.csv", 'w', newline='') 
    wr = csv.DictWriter(output, ['Name','ImageUrl','Height','Weight'], delimiter=',') 
    wr.writeheader() 

    for i in range(135): 
     url = "http://www.host.com/actors/all-actors/name/{}/".format(i) 
     response = requests.get(url) 
     html = response.content 
     soup = BeautifulSoup(html, "html.parser") 
     links = soup.find_all("div", { "class" : "card-image" }) 

     for a in links: 
      for url in a.find_all('a'): 
       url = "http://www.host.com" + url['href'] 
       print(url) 
       response = requests.get(url) 
       html = response.content 
       actor_dict = get_actor_dict_from_html(url, html) 
       wr.writerow(actor_dict) 
    print('All Done!') 

if __name__ == "__main__": 
    lotta_downloads() 

謝謝!

+0

通常,你最好不要重新發明輪子和使用像一個'Scrapy'網絡框架。 – alecxe

回答

0

爲什麼不嘗試使用gevent庫?

gevent庫有monkey patch使阻塞功能非阻塞功能。

也許wait time的請求太多,太慢了。

所以我認爲,使請求作爲非阻塞函數使您的程序更快。

Python的2.7.10 例如:

import gevent 
from gevent import monkey; monkey.patch_all() # Fix import code 
import reqeusts 

actor_dict_list = [] 

def worker(url): 
    content = requests.get(url).content 
    bs4.BeautifulSoup(content) 
    links = soup.find_all('div', {'class': 'card-image'}) 

    for a in links: 
     for url in a.find_all('a'): 
      response = requests.get(url) # You can also use gevent spawn function on this line 
      ... 
      actor_dict_list.append(get_actor_dict_from_html(url, html)) # Because of preventing race condition 

output = open("/tmp/export.csv", "w", newline='') 
wr = csv.DictWriter(output, ['Name', 'ImageUrl', 'Height', 'Weight'], delimiter=',') 
wr.writeheader() 

urls = ["http://www.host.com/actors/all-actors/name/{}/".format(i) for i in range(135)] 
jobs = [gevent.spawn(worker, url) for url in urls] 
gevent.joinall(jobs) 
for i in actor_dict_list: 
    wr.writerow(actor_dict) 

公共GEVENT文件:doc

附:

您必須安裝python-GEVENT如果你有Ubuntu的操作系統

sudo apt-get install python-gevent

+1

我遇到了麻煩,認爲我可以將它放在我的代碼中。它是否必須是函數內的函數? (lotta_downloads)? – kokozz

+0

哦,對不起。是我的錯。再次查看代碼。我修復了它。 – yumere