我環顧四周約線程開發很酷的簡單的網絡爬蟲,但它工作如此緩慢。python中的線程化:daemon.start()吃我爬蟲時間的20%。爲什麼?
下面是代碼片段我在ibm library發現:
urls = [] # huge list of urls
in_queue = Queue.Queue()
out_queue = Queue.Queue()
pool = ActivePool()
s = threading.Semaphore(semaphore)
for url in urls[:slice_size]:
in_queue.put(url)
t = ThreadUrl(pool, s, url, in_queue, out_queue)
t.setDaemon(True)
t.start()
counter = slice_size
while not in_queue.empty() or not out_queue.empty():
speed_new_daemon = time.time()
url = urls[counter]
in_queue.put(url)
t = ThreadUrl(pool, s, url, in_queue, out_queue)
t.setDaemon(True)
t.start() # <------ why 20% of all time I lose here?
counter += 1
speed_new_daemon = time.time() - speed_new_daemon
speed_parser = time.time()
result = out_queue.get()
my_parser(result)
speed_parser = time.time() - speed_parser
# speed_parser only 80%, when speed_new_daemon takes 20%...
in_queue.join()
多線程代碼是很難的個人資料。你是如何衡量這一點的? –
這不是來自鏈接文檔的代碼。 – abarnert
此外,您的標題引用的行不在您的源代碼或IBM中。 – abarnert