2010-09-16 104 views
2

我想在簡單的線程中使用urllib3來獲取幾個wiki頁面。 腳本將示例urllib3和蟒蛇中的線程

爲每個線程創建1個連接(我不明白爲什麼)並永久掛起。 任何提示,建議或urllib3的簡單的例子,線程

import threadpool 
from urllib3 import connection_from_url 

HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True) 

def fetch(url, fiedls): 
    kwargs={'retries':6} 
    return HTTP_POOL.get_url(url, fields, **kwargs) 

pool = threadpool.ThreadPool(5) 
requests = threadpool.makeRequests(fetch, iterable) 
[pool.putRequest(req) for req in requests] 

@倫納特的劇本得到這個錯誤:

http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run 
    result = request.callable(*request.args, **request.kwds) 
    File "crawler.py", line 9, in fetch 
    print url, conn.get_url(url) 
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url' 

加入import threadpool; import urllib3tpool = threadpool.ThreadPool(4) @ user318904的代碼後得到這個錯誤:

Traceback (most recent call last): 
    File "crawler.py", line 21, in <module> 
    tpool.map_async(fetch, urls) 
AttributeError: ThreadPool instance has no attribute 'map_async' 

回答

1

很明顯,它會爲每個線程創建一個連接,每個線程應該怎樣才能獲取一個頁面?並且您嘗試使用同一個連接,由一個網址製作,適用於所有網址。這不可能是你想要的。

此代碼工作得很好:

import threadpool 
from urllib3 import connection_from_url 

def fetch(url): 
    kwargs={'retries':6} 
    conn = connection_from_url(url, timeout=10.0, maxsize=10, block=True) 
    print url, conn.get_url(url) 
    print "Done!" 

pool = threadpool.ThreadPool(4) 
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League', 
     'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes', 
     'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes', 
     'http://en.wikipedia.org/wiki/List_of_Unicode_characters', 
     ] 
requests = threadpool.makeRequests(fetch, urls) 

[pool.putRequest(req) for req in requests] 
pool.wait() 
0

我用的是這樣的:

#excluding setup for threadpool etc 

upool = urllib3.HTTPConnectionPool('en.wikipedia.org', block=True) 

urls = ['/wiki/2010-11_Premier_League', 
     '/wiki/List_of_MythBusters_episodes', 
     '/wiki/List_of_Top_Gear_episodes', 
     '/wiki/List_of_Unicode_characters', 
     ] 

def fetch(path): 
    # add error checking 
    return pool.get_url(path).data 

tpool = ThreadPool() 

tpool.map_async(fetch, urls) 

# either wait on the result object or give map_async a callback function for the results 
1

線程編程是很難的,所以我寫了workerpool讓你在做什麼容易。

更具體而言,請參見Mass Downloader示例。

要做到同樣的事情urllib3,它看起來是這樣的:

import urllib3 
import workerpool 

pool = urllib3.connection_from_url("foo", maxsize=3) 

def download(url): 
    r = pool.get_url(url) 
    # TODO: Do something with r.data 
    print "Downloaded %s" % url 

# Initialize a pool, 5 threads in this case 
pool = workerpool.WorkerPool(size=5) 

# The ``download`` method will be called with a line from the second 
# parameter for each job. 
pool.map(download, open("urls.txt").readlines()) 

# Send shutdown jobs to all threads, and wait until all the jobs have been completed 
pool.shutdown() 
pool.wait() 

對於更復雜的代碼,看看workerpool.EquippedWorker(和the tests here例如使用)。你可以讓游泳池成爲你通過的toolbox