我的循環在scrapy中沒有按順序運行

我正在抓取一系列網址。代碼正在工作，但scrapy不會按順序解析url。例如。儘管我試圖解析url1，url2，...，url100，但scrapy會解析url2，url10，url1 ...等。我的循環在scrapy中沒有按順序運行

它解析所有的url，但是當一個特定的url不存在時（例如example.com/unit.aspx?b_id=10），Firefox會顯示我之前請求的結果。因爲我想確保我沒有重複的內容，所以我需要確保循環順序解析url，而不是「隨意」。

我試圖「在範圍（1,101）n和也「而出價< 100」的結果是一樣的。（見下文）

在此先感謝！

def check_login_response(self, response): 
    """Check the response returned by a login request to see if we are 
    successfully logged in. 
    """ 
    if "Welcome!" in response.body: 
     self.log("Successfully logged in. Let's start crawling!") 
     print "Successfully logged in. Let's start crawling!" 
     # Now the crawling can begin.. 
     self.initialized() 
     bID=0 
     #for n in range(1,100,1): 
     while bID<100: 
      bID=bID+1 
      startURL='https://www.example.com/units.aspx?b_id=%d' % (bID) 
      request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]}) 
      # print self.metabID 
      yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2) 
    else: 
     self.log("Something went wrong, we couldn't log in....Bad times :(") 
     # Something went wrong, we couldn't log in, so nothing happens.

來源

2013-02-01 Jmm

你可以嘗試這樣的事情。我不確定它是否適合目的，因爲我沒有看到其他的蜘蛛代碼，但在這裏你去：

# create a list of urls to be parsed, in reverse order (so we can easily pop items off) 
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)] 

def check_login_response(self, response): 
    """Check the response returned by a login request to see if we are successfully logged in. 
    """ 
    if "Welcome!" in response.body: 
     self.log("Successfully logged in. Let's start crawling!") 
     print "Successfully logged in. Let's start crawling!" 
     # Now the crawling can begin.. 
     self.initialized() 
     return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]}) 
    else: 
     self.log("Something went wrong, we couldn't log in....Bad times :(") 
     # Something went wrong, we couldn't log in, so nothing happens. 

def parse_add_tables(self, response): 
    # parsing code here 
    if self.crawl_urls: 
     next_url = self.crawl_urls.pop() 
     return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]}) 

    return items

來源

2013-02-02 17:59:04 Talvalin

謝謝！這工作 – Jmm

可以使用Request對象中的使用優先級屬性，Scrapy保證URL在默認情況下在DFO中被抓取，但並不能保證URL在你的解析回調中按順序被訪問

而不是產生Request對象想要返回一個arr ay從中彈出對象直到它爲空的請求。

欲瞭解更多信息，你可以在這裏看到

Scrapy Crawl URLs in Order

來源

2013-02-02 06:45:40 user2134226

謝謝你的答案！我搜索了索引但我沒有找到這個帖子。我是python和scrapy的新手，所以我需要了解如何更改默認屬性。 – Jmm

我的循環在scrapy中沒有按順序運行

回答

相關問題