2017-10-20 71 views
0

背景:對象線程之間共享產生NoneType

我工作的一個網絡爬蟲會派生7成一線,每個查詢唯一的網址的XML文件。當每個查詢收到響應,事實證明這種反應到一個XML樹,像這樣:

conn = http.client.HTTPSConnection(host = uHost, port = uPort) 
conn.request('GET', url = '/some/url/file.xml') 
resp = conn.getresponse() 
tree = xml.etree.ElementTree.parse(resp) 

當每個線程啓動時,它被賦予一個queue.Queue()作爲參數,以便它可以把tree到其中,因此__main__是寫入文件的唯一線程。從上面繼續:

__main__

def receive(q): 
    while True: 
     try: 
      uTree = q.get() 
      uTree.write('/some/path/file.xml') 
     except queue.Empty: 
      pass 

催生

conn = http.client.HTTPSConnection(host = uHost, port = uPort) 
conn.request('GET', url = '/some/url/file.xml') 
resp = conn.getresponse() 
tree = xml.etree.ElementTree.parse(resp) 
q.put_nowait(tree) 

不過,我開始接受AttributeError: 'NoneType' object has no attribute 'write'調用uTree.write()時。的uTree.write()print(type(uTree))快速變化表明,對象有時會保持不變,其他時間他們成爲NoneType

<class 'xml.etree.ElementTree.ElementTree'> 
<class 'xml.etree.ElementTree.ElementTree'> 
<class 'xml.etree.ElementTree.ElementTree'> 
<class 'xml.etree.ElementTree.ElementTree'> 
<class 'NoneType'> 
<class 'NoneType'> 
<class 'xml.etree.ElementTree.ElementTree'> 
<class 'xml.etree.ElementTree.ElementTree'> 

問題:

爲什麼從threading.Thread()傳遞的對象爲queue.Queue() [駐留在__main__ ]改爲NoneType不一致?

我該如何解決這個問題?

完整的代碼(如果需要):

main.py

import queue 
import crawl # custom module 
import threading 

def crawler(query): 
    while True: 
     try: 
      query.connect() 
      break 
     except: 
      pass 

def receive(q): 
    while True: 
     try: 
      uQuery = q.get() 
      uTree = uQuery.tree 
      uTree.write('/some/path/file.xml') 
     except queue.Empty: 
      pass 

urls = ['/url1.xml', '/url2.xml', ...] 

q = queue.Queue() 

queries = [Query(url, q) for url in urls] 
threads = [threading.Thread(target = crawler, args = (query,)) for query in queres] 

for t in threads: 
    t.start() 

receive(q) 

crawl.py

import http.client 
import xml.etree.ElementTree as ET 

class Query: 
    def __init__(self, url, q): 
     self.url = url 
     self.queue = q 
     self.tree = None 

    def connect(): 
     conn = http.Client.HTTPConnect(host = 'something.com', port = '80') 
     conn.request('GET', url = self.url) 
     resp = conn.getresponse() 
     self.tree = ET.parse(resp) 
     self.queue.put_nowait(self) 
     conn.close() 

回答

0

(我會發表評論,但不似乎有聲望)

這不是解決您的問題,但可能會給你一些指點。

我知道這是調試線程的問題更困難,但我會建議簡化您的例子。您正在使用ElementTree和HTTP連接解析XML - 這兩個問題似乎都沒有關係。

解決您的問題,您也可能獲得從登錄你把什麼放入隊列的見解。

我會建議把複雜的對象,如經過解析的樹時,到隊列中要格外小心。然後你需要確保這種對象本身是線程安全的。

如果你不知道的話,我會建議使用https://scrapy.org/這將使實現履帶容易得多。