使用多線程優化python腳本

Hello！我寫了小型網頁爬蟲功能。但我是多線程新手，我無法優化它。我的代碼是：使用多線程優化python腳本

alreadySeenURLs = dict() #the dictionary of already seen crawlers 
candidates = set() #the set of URL candidates to crawl 

def initializeCandidates(url): 

    #gets page with urllib2 
    page = getPage(url) 

    #parses page with BeautifulSoup 
    parsedPage = getParsedPage(page) 

    #function which return all links from parsed page as set 
    initialURLsFromRoot = getLinksFromParsedPage(parsedPage) 

    return initialURLsFromRoot 

def updateCandidates(oldCandidates, newCandidates): 
    return oldCandidates.union(newCandidates) 

candidates = initializeCandidates(rootURL) 

for url in candidates: 

    print len(candidates) 

    #fingerprint of URL 
    fp = hashlib.sha1(url).hexdigest() 

    #checking whether url is in alreadySeenURLs 
    if fp in alreadySeenURLs: 
     continue 

    alreadySeenURLs[fp] = url 

    #do some processing 
    print url 

    page = getPage(url) 
    parsedPage = getParsedPage(page, fix=True) 
    newCandidates = getLinksFromParsedPage(parsedPage) 

    candidates = updateCandidates(candidates, newCandidates)

正如人們可以看到的，這裏它在特定時間需要一個來自候選人的URL。我想讓這個腳本多線程，以這樣的方式，它可能需要至少N個候選人的URL，並完成這項工作。任何人都可以引導我？給出任何鏈接或建議？

來源

2012-05-23 torayeff

有很多關於線程的教程，只是Google的「python線程教程」。線程教程用Python編程（https://users.info.unicaen.fr/~fmaurel/documents/envrac/python/PyThreads.pdf）是絕對初學者的一個很好的教程。 – taskinoor

您可以通過這兩個環節入手：

基本參考了在Python 線程http://docs.python.org/library/threading.html
的講解，他們實際上是在Python實現多線程URL履帶 http://www.ibm.com/developerworks/aix/library/au-threadingpython/

此外，你已經有一個Python的爬蟲：http://scrapy.org/

來源

2012-05-23 14:59:39 betabandido

使用多線程優化python腳本

回答

相關問題