Scrapy - 關注RSS鏈接

我想知道是否有人試圖用 SgmlLinkExtractor/CrawlSpider提取/關注RSS項鍊接。我無法得到它的工作...Scrapy - 關注RSS鏈接

我使用以下規則：

 

    rules = (
     Rule(SgmlLinkExtractor(tags=('link',), attrs=False), 
      follow=True, 
      callback='parse_article'), 
     )

（記住具有RSS鏈接位於鏈接標籤）。

我不知道如何告訴SgmlLinkExtractor提取的文本（）的鏈接，而不是搜索屬性...

任何幫助是值得歡迎的，在此先感謝

來源

2010-05-30 kal3v

CrawlSpider規則不要那樣工作。您可能需要繼承BaseSpider並在您的蜘蛛回調中實現您自己的鏈接提取。例如：

from scrapy.spider import BaseSpider 
from scrapy.http import Request 
from scrapy.selector import XmlXPathSelector 

class MySpider(BaseSpider): 
    name = 'myspider' 

    def parse(self, response): 
     xxs = XmlXPathSelector(response) 
     links = xxs.select("//link/text()").extract() 
     return [Request(x, callback=self.parse_link) for x in links]

您也可以嘗試在shell中的XPath，通過運行例如：

scrapy shell http://blog.scrapy.org/rss.xml

然後鍵入殼：

>>> xxs.select("//link/text()").extract() 
[u'http://blog.scrapy.org', 
u'http://blog.scrapy.org/new-bugfix-release-0101', 
u'http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release']

來源

2010-09-19 20:29:13

請你解釋一下使用CrawlSpider規則和回調實現自定義鏈接提取的區別？我一直在努力尋找差異，並且在多次閱讀文檔之後...仍然沒有任何結果。由於過去使用規則的不好經歷，我會採用你的方法，但我只想知道原因。 T.I.A – romeroqj 2011-07-06 03:23:19

現在可以使用['''XMLFeedSpider''']（https://scrapy.readthedocs.org/en/latest/topics/spiders.html?highlight=rule#xmlfeedspider-example）。 – opyate 2013-04-19 12:15:52

我已經做到了使用CrawlSpider：

class MySpider(CrawlSpider): 
    domain_name = "xml.example.com" 

    def parse(self, response): 
     xxs = XmlXPathSelector(response) 
     items = xxs.select('//channel/item') 
     for i in items: 
      urli = i.select('link/text()').extract() 
      request = Request(url=urli[0], callback=self.parse1) 
      yield request 

    def parse1(self, response): 
     hxs = HtmlXPathSelector(response) 
     # ... 
     yield(MyItem())

但我不確定這是一個非常合適的解決方案...

來源

2010-10-01 21:55:33 kal3v

有一個XMLFeedSpider現在可以使用。

來源

2013-04-19 12:17:02 opyate

今天它是更好的解決方案。 +1 – Jon 2013-12-09 11:03:36

-1

XML實例從scrapy DOC XMLFeedSpider

from scrapy.spiders import XMLFeedSpider 
from myproject.items import TestItem 

class MySpider(XMLFeedSpider): 
    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com/feed.xml'] 
    iterator = 'iternodes' # This is actually unnecessary, since it's the default value 
    itertag = 'item' 

    def parse_node(self, response, node): 
     self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract())) 

     #item = TestItem() 
     item = {} # change to dict for removing the class not found error 
     item['id'] = node.xpath('@id').extract() 
     item['name'] = node.xpath('name').extract() 
     item['description'] = node.xpath('description').extract() 
     return item

來源

2016-08-15 03:09:49 NGloom

Scrapy - 關注RSS鏈接

回答

相關問題