我正在抓取一系列網址。代碼正在工作,但scrapy不會按順序解析url。例如。儘管我試圖解析url1,url2,...,url100,但scrapy會解析url2,url10,url1 ...等。我的循環在scrapy中沒有按順序運行
它解析所有的url,但是當一個特定的url不存在時(例如example.com/unit.aspx?b_id=10),Firefox會顯示我之前請求的結果。因爲我想確保我沒有重複的內容,所以我需要確保循環順序解析url,而不是「隨意」。
我試圖「在範圍(1,101)n和也 「而出價< 100」 的結果是一樣的。(見下文)
在此先感謝!
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Welcome!" in response.body:
self.log("Successfully logged in. Let's start crawling!")
print "Successfully logged in. Let's start crawling!"
# Now the crawling can begin..
self.initialized()
bID=0
#for n in range(1,100,1):
while bID<100:
bID=bID+1
startURL='https://www.example.com/units.aspx?b_id=%d' % (bID)
request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]})
# print self.metabID
yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2)
else:
self.log("Something went wrong, we couldn't log in....Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
謝謝!這工作 – Jmm