不能讓Scrapy Crawlspider跟隨鏈接

我想要讓Scrapy Crawlspider的'規則'部分正常工作。不能讓Scrapy Crawlspider跟隨鏈接

我發現了xpath，它返回我想要遵循的鏈接。它是

//*[@class="course_detail"]//td[4]/a/@href

並且它總共返回約2700個URL。

基本上，我想告訴蜘蛛遵循匹配xpath一切，但我不能讓下面的代碼才能正常工作：

rules = (
    Rule(SgmlLinkExtractor(allow=[r'.*'], 
          restrict_xpaths='//*[@class="course_detail"]//td[4]/a/@href' 
          ),    
     callback='parse_item' 
     ), 
)

我不得到任何錯誤，但蜘蛛似乎並沒有超過我在start_urls中定義的頁面。

編輯：想通了！只需要刪除@href。海登的代碼也有幫助，所以我給了他答案。

來源

2012-09-29 Jonathan

我認爲allow和restrict_xpaths傳遞到SgmlLinkExtractor應該是相同類型（即列表或兩個字符串）。大多數示例使用tuples：

rules = (
    Rule(SgmlLinkExtractor(allow = (r'.*',), 
          restrict_xpaths = ('//*[@class="course_detail"]//td[4]/a/@href',) 
          ),    
     callback='parse_item' 
     ), 
)

作爲預留喜歡用Egyptian Brackets嘗試和跟蹤在那裏我的論點。

來源

2012-09-29 21:29:50

感謝您的回覆Hayden！不幸的是，我仍然有同樣的問題:( – Jonathan

不能讓Scrapy Crawlspider跟隨鏈接

回答

相關問題