2
Hello!我寫了小型網頁爬蟲功能。但我是多線程新手,我無法優化它。我的代碼是:使用多線程優化python腳本
alreadySeenURLs = dict() #the dictionary of already seen crawlers
candidates = set() #the set of URL candidates to crawl
def initializeCandidates(url):
#gets page with urllib2
page = getPage(url)
#parses page with BeautifulSoup
parsedPage = getParsedPage(page)
#function which return all links from parsed page as set
initialURLsFromRoot = getLinksFromParsedPage(parsedPage)
return initialURLsFromRoot
def updateCandidates(oldCandidates, newCandidates):
return oldCandidates.union(newCandidates)
candidates = initializeCandidates(rootURL)
for url in candidates:
print len(candidates)
#fingerprint of URL
fp = hashlib.sha1(url).hexdigest()
#checking whether url is in alreadySeenURLs
if fp in alreadySeenURLs:
continue
alreadySeenURLs[fp] = url
#do some processing
print url
page = getPage(url)
parsedPage = getParsedPage(page, fix=True)
newCandidates = getLinksFromParsedPage(parsedPage)
candidates = updateCandidates(candidates, newCandidates)
正如人們可以看到的,這裏它在特定時間需要一個來自候選人的URL。我想讓這個腳本多線程,以這樣的方式,它可能需要至少N個候選人的URL,並完成這項工作。任何人都可以引導我?給出任何鏈接或建議?
有很多關於線程的教程,只是Google的「python線程教程」。線程教程用Python編程(https://users.info.unicaen.fr/~fmaurel/documents/envrac/python/PyThreads.pdf)是絕對初學者的一個很好的教程。 – taskinoor