使用scrapy獲取鏈接和文本

我想查找具有特定regex的網頁的網址。我在python中使用了scrapy包。我的代碼看起來像這樣使用scrapy獲取鏈接和文本

name = 'testingcode' 
start_urls = ['http://dinoopnair.blogspot.in/'] # urls from which the spider will start crawling 
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
    # r'page/\d+' : regular expression for http://isbullsh.it/page/X URLs 
    Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_blogpost',follow=True)] 
    # r'\d{4}/\d{2}/\w+' : regular expression for http://isbullsh.it/YYYY/MM/title URLs 

def parse_blogpost(self, response): 
    print response.url

它工作正常。現在我想獲得鏈接的文本。例如

<a href="http://dinoopnair.blogspot.in/2014/07/facebook-search-and-elastic-search.html">facebook search and elastic search</a>

這是滿足我們的正則表達式的文章鏈接之一。我想在a標籤之間獲得文本「facebook搜索和彈性搜索」。如何從response回調函數的參數中找到文本？

來源

2015-05-07 Dinoop Nair

可以使用的XPath的幫助用於提取它們 – Jithin

響應具有參數response.url。像那樣有什麼方法可以找到文本？ –

其實我試圖找到一個通用的解決方案，而不是提供標籤或類名稱的名稱。如果來自父url的鏈接匹配正則表達式，則打印該鏈接和標記之間的文本，而不是讀取匹配url的內容。 –

我認爲這將滿足您的需求，

class TestSpider(Spider): #inherit from Spider intead of CrawlSpider 
     name = 'testingcode' 
     start_urls = ['http://dinoopnair.blogspot.in/'] 

     def parse(self, response): 
      base_selector = response.xpath('//h3[@class="post-title entry-title"]') 
      for sel in base_selector: 
       link = sel.xpath('./a/@href').extract() 
       link_text = sel.xpath('./a/text()').extract() 
       # clean the data 
       link = link[0] if link else 'n/a' 
       link_text = link_text[0].strip() if link else 'n/a' 
       print link, link_text

編輯

通用代碼，因爲用戶有幾個開始的URL

from scrapy.selector import Selector 
# other codes here 

def parse(self, response): 
    # change the regex accordingly 
    links = response.xpath('//a').re(r'href=".*\d{4}/\d{2}/.*') 
    for link in links: 
     sell = Selector(text='<a '+link) 
     link_text = sell.xpath('//a//text()').extract() 
     url = sell.xpath('//a/@href').extract() 
     link_text = ' '.join(link_text).strip() if link else 'n/a' 
     url = url[0] if link else 'n/a' 
     print(link_text, url)

來源

2015-05-07 07:51:57 Jithin

使用scrapy獲取鏈接和文本

回答

相關問題