如何在python中同時拋出多個html頁面與beautifulsoup？

我正在Django Web框架中使用Python製作webscraping應用程序。我需要使用beautifulsoup庫來取消多個查詢。下面是代碼的快照，我已經寫了：如何在python中同時拋出多個html頁面與beautifulsoup？

for url in websites: 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content) 
    links = soup.find_all("a", {"class":"dev-link"})

其實這裏網頁的抓取順序走，我想以並行的方式運行。我對Python中的線程沒有太多的想法。有人可以告訴我，我怎樣才能以平行的方式進行報廢？任何幫助，將不勝感激。

來源

2017-05-29 Amit

多少網頁，你想在同一時間刮？ – Exprator

您可以使用hadoop（http://hadoop.apache.org/）並行運行您的作業。這是運行並行任務的非常好的工具。

來源

2017-05-29 15:02:55

試試這個解決方案。

import threading 

def fetch_links(url): 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content) 
    return soup.find_all("a", {"class": "dev-link"}) 

threads = [threading.Thread(target=fetch_links, args=(url,)) 
      for url in websites] 

for t in thread: 
    t.start()

通過requests.get()下載網頁內容阻塞操作，和Python線程實際上可以提高性能。

來源

2017-05-29 15:08:54

如果你想使用多線程的話，

import threading 
import requests 
from bs4 import BeautifulSoup 

class Scrapper(threading.Thread): 
    def __init__(self, threadId, name, url): 
     threading.Thread.__init__(self) 
     self.name = name 
     self.id = threadId 
     self.url = url 

    def run(self): 
     r = requests.get(self.url) 
     soup = BeautifulSoup(r.content, 'html.parser') 
     links = soup.find_all("a") 
     return links 
#list the websites in below list 
websites = [] 
i = 1 
for url in websites: 
    thread = Scrapper(i, "thread"+str(i), url) 
    res = thread.run() 
    # print res

，當涉及到Python和拼搶，這可能是有幫助的

來源

2017-05-29 15:09:40

，scrapy可能是要走的路。

scrapy是使用twisted mertix庫並行所以你不必擔心線程和python GIL

如果必須使用beautifulsoap檢查this library出

來源

2017-05-29 15:29:53

如何在python中同時拋出多個html頁面與beautifulsoup？

回答

相關問題