url的模式是
http://www.khmer24.com/ad/change-petrol-to-gas-use-injector-special-price/67-204320.html
我想在域中保留域,廣告和數字67。下面是示例網址:
http://www.khmer24.com/ad/ANY-STRING/67-123456789.html如何用正則表達式編寫遞歸scrapy規則?
這裏是我的蜘蛛代碼:
from scrapy.item import Item, Field
class Khmer24(Item):
title = Field()
price = Field()
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class MySpider(CrawlSpider):
name = "khmer24"
allowed_domains = ["www.khmer24.com"]
start_urls = ["http://www.khmer24.com/"]
#HERE IS WHERE I GET STUCK
rules = (Rule (SgmlLinkExtractor(allow=("index/ad\d\s\67-\d00.html",),restrict_xpaths=('//p[@class="nextpage"]',))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div[@class='innerbox']")
items = []
for title in titles:
item = Khmer24()
item["title"] = title.select("h1/text()").extract()
item["price"] = title.select("table/tr/td/p[@class='description']/span[@class='price']/strong/text()").extract()
items.append(item)
return(items)
試試吧! ;-) – 2013-02-27 22:38:15