2

Hello!我寫了小型網頁爬蟲功能。但我是多線程新手,我無法優化它。我的代碼是:使用多線程優化python腳本

alreadySeenURLs = dict() #the dictionary of already seen crawlers 
candidates = set() #the set of URL candidates to crawl 

def initializeCandidates(url): 

    #gets page with urllib2 
    page = getPage(url) 

    #parses page with BeautifulSoup 
    parsedPage = getParsedPage(page) 

    #function which return all links from parsed page as set 
    initialURLsFromRoot = getLinksFromParsedPage(parsedPage) 

    return initialURLsFromRoot 

def updateCandidates(oldCandidates, newCandidates): 
    return oldCandidates.union(newCandidates) 

candidates = initializeCandidates(rootURL) 

for url in candidates: 

    print len(candidates) 

    #fingerprint of URL 
    fp = hashlib.sha1(url).hexdigest() 

    #checking whether url is in alreadySeenURLs 
    if fp in alreadySeenURLs: 
     continue 

    alreadySeenURLs[fp] = url 

    #do some processing 
    print url 

    page = getPage(url) 
    parsedPage = getParsedPage(page, fix=True) 
    newCandidates = getLinksFromParsedPage(parsedPage) 

    candidates = updateCandidates(candidates, newCandidates) 

正如人們可以看到的,這裏它在特定時間需要一個來自候選人的URL。我想讓這個腳本多線程,以這樣的方式,它可能需要至少N個候選人的URL,並完成這項工作。任何人都可以引導我?給出任何鏈接或建議?

+2

有很多關於線程的教程,只是Google的「python線程教程」。線程教程用Python編程(https://users.info.unicaen.fr/~fmaurel/documents/envrac/python/PyThreads.pdf)是絕對初學者的一個很好的教程。 – taskinoor

回答