加速HTTP請求python和500錯誤

我有一個代碼，使用查詢和時間框架（可能會長達一年）從此newspaper檢索新聞結果。加速HTTP請求python和500錯誤

結果每頁分頁最多10篇文章，由於我找不到增加它的方法，我爲每個頁面發出請求，然後檢索每篇文章的標題，網址和日期。每個週期（HTTP請求和解析）需要30秒到1分鐘，這非常緩慢。最終它會停止響應代碼爲500.我想知道是否有辦法加速它或可能一次發出多個請求。我只是想檢索所有頁面中的文章細節。下面是代碼：

import requests 
    import re 
    from bs4 import BeautifulSoup 
    import csv 

    URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0' 


    def run(**params): 
     countryFile = open("EgyptDaybyDay.csv","a") 
     i=1 
     results = True 
     while results: 
        params["index"]=str(i) 
        response = requests.get(URL.format(**params)) 
        print response.status_code 
        htmlFile = BeautifulSoup(response.content) 
        articles = htmlFile.findAll("div", { "class" : "newslist" }) 

        for article in articles: 
           url = (article.a['href']).encode('utf-8','ignore') 
           title = (article.img['alt']).encode('utf-8','ignore') 
           dateline = article.find("div",{"class": "floatright"}) 
           m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string) 
           date = m.group(1) 
           w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL) 
           w.writerow((date, title, url)) 

        if not articles: 
           results = False 
        i+=1 
     countryFile.close() 


    run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")

來源

2013-03-22 Jiyda Moussa

這是一個很好的機會來嘗試gevent。

對於request.get部分，您應該有一個單獨的例程，以便您的應用程序不必等待IO阻塞。

然後，你可以產生多個工人，並有隊列傳遞請求和文章。也許與此類似：

import gevent.monkey 
from gevent.queue import Queue 
from gevent import sleep 
gevent.monkey.patch_all() 

MAX_REQUESTS = 10 

requests = Queue(MAX_REQUESTS) 
articles = Queue() 

mock_responses = range(100) 
mock_responses.reverse() 

def request(): 
    print "worker started" 
    while True: 
     print "request %s" % requests.get() 
     sleep(1) 

     try: 
      articles.put('article response %s' % mock_responses.pop()) 
     except IndexError: 
      articles.put(StopIteration) 
      break 

def run(): 
    print "run" 

    i = 1 
    while True: 
     requests.put(i) 
     i += 1 

if __name__ == '__main__': 
    for worker in range(MAX_REQUESTS): 
     gevent.spawn(request) 

    gevent.spawn(run) 
    for article in articles: 
     print "Got article: %s" % article

來源

2013-03-24 19:04:30 baloo

你也可以做到這一點與扭曲蟒蛇和遞延事件 – 2013-03-24 19:23:03

的名單我現在認識到迭代可能實際的一篇文章中被發現之前停止。但你明白了 – baloo 2013-03-24 19:55:28