Python的線程模塊循環參數？

我試圖創建一個抓取網站上的前100頁履帶：Python的線程模塊循環參數？

我的代碼是這樣的：

def extractproducts(pagenumber): 
    contenturl = "http://websiteurl/page/" + str(pagenumber) 

    content = BeautifulSoup(urllib2.urlopen(contenturl).read()) 
    print pagehtml 



pagenumberlist = range(1, 101) 

for pagenumber in pagenumberlist: 
    extractproducts(pagenumber)

我如何去使用線程模塊在這種情況下，這樣的urllib將使用多線程一次抓取X個URL？

/福利局出

來源

2012-06-14 user1271067

最有可能的，你要使用multiprocessing。有一個Pool您可以使用並行執行多個件事：如果你的函數返回任何

from multiprocessing import Pool 

# Note: This many threads may make your system unresponsive for a while 
p = Pool(100) 

# First argument is the function to call, 
# second argument is a list of arguments 
# (the function is called on each item in the list) 
p.map(extractproducts, pagenumberlist)

，Pool.map將返回返回值的列表：

def f(x): 
    return x + 1 

results = Pool().map(f, [1, 4, 5]) 
print(results) # [2, 5, 6]

來源

2012-06-14 16:51:58

哇，這是快，正是我正在尋找。互聯網點爲您罰款先生。 – user1271067

@ user1271067如果這對您有幫助，如果您單擊我的答案旁邊的複選標記以將其標記爲「已接受」（如果需要，請單擊向上箭頭）將會很好。 –

沒有足夠的代表。現在就給你一個upvote。：/ 順便說一句，多處理似乎並沒有讓urllib抓取頁面更快，但似乎一次只抓取一個頁面。可能只是我的慢速連接，但我會租用一臺服務器以更快的互聯網，看看它是否更好。但是，指引我進行多重處理已經足夠好了，我相信在通過StackOverflow上的官方文檔和其他問題後，我會弄清楚它。 – user1271067

Python的線程模塊循環參數？

回答

相關問題