遞歸刮Web頁面Scrapy

「http://www.example.com/listing.php?num=2&」遞歸刮Web頁面Scrapy

這裏是顯示在一個頁面上的鏈接列表中的我的蜘蛛代碼：

from scrapy.log import * 
from crawler_bhinneka.settings import * 
from crawler_bhinneka.items import * 
import pprint 
from MySQLdb import escape_string 
import urlparse 

def complete_url(string): 
    """Return complete url""" 
    return "http://www.example.com" + string 


class BhinnekaSpider(CrawlSpider): 

    name = 'bhinneka_spider' 
    start_urls = [ 
     'http://www.example.com/listing.php?' 
    ] 
    def parse(self, response): 

     hxs = HtmlXPathSelector(response) 

     # HXS to find url that goes to detail page 
     items = hxs.select('//td[@class="lcbrand"]/a/@href') 
     for item in items: 
      link = item.extract() 
      print("my Url Link : ",complete_url(link))

知道我可以得到我的第一個所有鏈接頁。

我想通過遞歸規則使用這個蜘蛛來跟隨下一頁的鏈接你知道如何在蜘蛛中嘗試我的規則來獲取下一頁的鏈接值。

編輯

@Toan，感謝你的回覆。我試圖讓你發給我的這個教程鏈接，但我只是把一個頁面（第一頁）的項目值。

我看了看源代碼在這個網址：「http://sfbay.craigslist.org/npo/」我沒有看到的XPath，在這種restrict_xpaths（類=「下一頁doies 不在代碼源存在）

匹配的值

這裏是你的規則聯繫起來，例如：

rules = (Rule (SgmlLinkExtractor (allow = ("index \ d00 \. html") restrict_xpaths = ('//p [@ class = "nextpage"]')) 
    , Callback = "parse_items" follow = True) 
    )

來源

2014-07-24 pi-2r

Scrapy linkextractors用於提取網頁的鏈接

下面是一個示例：http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.U9Dl8h_FsUQ

來源

2014-07-24 10:57:45

鏈接可能過時。這裏最好包括重要的部分以供參考。 – Kasisnu

遞歸刮Web頁面Scrapy

回答

相關問題