2013-02-01 16 views
0

我正在抓取一系列網址。代碼正在工作,但scrapy不會按順序解析url。例如。儘管我試圖解析url1,url2,...,url100,但scrapy會解析url2,url10,url1 ...等。我的循環在scrapy中沒有按順序運行

它解析所有的url,但是當一個特定的url不存在時(例如example.com/unit.aspx?b_id=10),Firefox會顯示我之前請求的結果。因爲我想確保我沒有重複的內容,所以我需要確保循環順序解析url,而不是「隨意」。

我試圖「在範圍(1,101)n和也 「而出價< 100」 的結果是一樣的。(見下文)

在此先感謝!

def check_login_response(self, response): 
    """Check the response returned by a login request to see if we are 
    successfully logged in. 
    """ 
    if "Welcome!" in response.body: 
     self.log("Successfully logged in. Let's start crawling!") 
     print "Successfully logged in. Let's start crawling!" 
     # Now the crawling can begin.. 
     self.initialized() 
     bID=0 
     #for n in range(1,100,1): 
     while bID<100: 
      bID=bID+1 
      startURL='https://www.example.com/units.aspx?b_id=%d' % (bID) 
      request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]}) 
      # print self.metabID 
      yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2) 
    else: 
     self.log("Something went wrong, we couldn't log in....Bad times :(") 
     # Something went wrong, we couldn't log in, so nothing happens. 

回答

0

你可以嘗試這樣的事情。我不確定它是否適合目的,因爲我沒有看到其他的蜘蛛代碼,但在這裏你去:

# create a list of urls to be parsed, in reverse order (so we can easily pop items off) 
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)] 

def check_login_response(self, response): 
    """Check the response returned by a login request to see if we are successfully logged in. 
    """ 
    if "Welcome!" in response.body: 
     self.log("Successfully logged in. Let's start crawling!") 
     print "Successfully logged in. Let's start crawling!" 
     # Now the crawling can begin.. 
     self.initialized() 
     return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]}) 
    else: 
     self.log("Something went wrong, we couldn't log in....Bad times :(") 
     # Something went wrong, we couldn't log in, so nothing happens. 

def parse_add_tables(self, response): 
    # parsing code here 
    if self.crawl_urls: 
     next_url = self.crawl_urls.pop() 
     return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]}) 

    return items 
+0

謝謝!這工作 – Jmm

0

可以使用Request對象中的使用優先級屬性,Scrapy保證URL在默認情況下在DFO中被抓取,但並不能保證URL在你的解析回調中按順序被訪問

而不是產生Request對象想要返回一個arr ay從中彈出對象直到它爲空的請求。

欲瞭解更多信息,你可以在這裏看到

Scrapy Crawl URLs in Order

+0

謝謝你的答案!我搜索了索引但我沒有找到這個帖子。我是python和scrapy的新手,所以我需要了解如何更改默認屬性。 – Jmm