Scrapy是否抓取所有規則鏈接？

代碼來源：http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#rules 我新來python和scrapy。我搜索了遞歸蜘蛛，發現了這個。Scrapy是否抓取所有規則鏈接？

我有幾個問題：

以下是如何工作的？它只是從頁面獲取href鏈接並將其添加到請求隊列中？

哪個部分的網頁會scrapy抓取？

下面的代碼是否刮掉網頁上的所有鏈接？

可以說，我要爬網，並從該網站http://downloads.trendnet.com/

下載的每一個文件的方式，我可能會做到這一點是湊這個網站上的每一個環節，並檢查URL的內容標題和下載，如果它是一個文件。這是可行的嗎？

很抱歉，如果這是一個糟糕的問題....

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from craigslist_sample.items import CraigslistSampleItem 

class MySpider(CrawlSpider): 
    name = "craigs" 
    allowed_domains = ["sfbay.craigslist.org"] 
    start_urls = ["http://sfbay.craigslist.org/search/npo"] 

    rules = (
     Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True), 
    ) 

    def parse_items(self, response): 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.xpath('//span[@class="pl"]') 
     items = [] 
     for titles in titles: 
      item = CraigslistSampleItem() 
      item["title"] = titles.xpath("a/text()").extract() 
      item["link"] = titles.xpath("a/@href").extract() 
      items.append(item) 
     return(items)

來源

2016-03-24 Kevin Pook

我覺得RTFM是真的在這裏適用，但給你一個簡單的答案：

至於給出的例子

rules = (
     Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True), 
    )

您問過它抓取的是什麼。它只會抓取您在規則下設置的內容。這意味着您的機器人只能每次抓取下一頁。對於它找到的每個頁面，它都會：callback = parse_items。

def parse_items(self, response): 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.xpath('//span[@class="pl"]') 
     items = [] 
     for titles in titles: 
      item = CraigslistSampleItem() 
      item["title"] = titles.xpath("a/text()").extract() 
      item["link"] = titles.xpath("a/@href").extract() 
      items.append(item) 
     return(items)

parse_items在這種情況下做的是檢查列表中的條目。您可以通過xpath定義列表（如上面的titles = hxs.xpath('//span[@class="pl"]')所示）。對於列表中的每個條目（即for titles in titles:），它將文本和鏈接複製到一個項目中。然後它返回物品（也就是完成）。

Parse_items是爲抓取程序按照下一個按鈕找到的每個頁面完成的。

在設置下，您可以包括DEPTH_LIMIT=3。在這種情況下，您的爬網爬蟲只能爬行3次。

至於您發佈的網站：

不，你不需要crawlspider，因爲沒有多頁。普通的基地蜘蛛就足夠了。然而，爬行滑塊可以工作，我會在下面顯示一些位。將規則設置爲restrict_xpath（'// a'，），它將遵循頁面上的所有鏈接。

確保您的item.py包含所有必需的項目。例如，它下面指的是item [「link」]。在item.py中，確保包含一個名爲link（caps-sensitive）的項目，即確保line - link = Field（） - 在那裏。

在parse_items，做這樣的事情：

def parse_items(self, response): 
     list = response.xpath('//a"') 
     items = [] 
     for titles in list: 
      item = [INSERT WHATEVER YOU CALLED YOUR ITEM] 
      item["title"] = titles.xpath("/text()").extract() 
      item["link"] = titles.xpath("/@href").extract() 
      if ".pdf" in item["link"]: 
       SEE COMMENT BELOW 
     return(items)

你需要做的最後一位是如何檢查的項目，管道工程。它在你的項目中使用file_urls等。

來源

2016-03-24 15:01:23 eadebruijn

Scrapy是否抓取所有規則鏈接？

回答

相關問題