Scrapy與Scrapy蜘蛛

-1

我想刮website。我想要做的提取是文檔列表，作者姓名和日期。我觀看了一些scrapy蜘蛛視頻，並能夠找出3個shell腳本命令，從網站上提供所需的數據。該命令是Scrapy與Scrapy蜘蛛

scrapy shell https://www.cato.org/research/34/commentary

日期：

response.css('span.date-display-single::text').extract()

作者：

response.css('p.text-sans::text').extract()

在頁面的文檔鏈接

response.css('p.text-large.experts-more-h > a::text').extract()

我試圖得到它通過Python，但都是徒勞的。由於有多個數據。

這裏是Python代碼：

import scrapy 
class CatoSpider(scrapy.Spider): 

    name = 'cato' 

    allowed_domains = ['cato.org'] 

    start_urls = ['https://www.cato.org/research/34/commentary'] 


def parse(self, response): 

    pass

來源

2017-09-06 Shad

不要使用'css'此，更好的是'xpath' – AndMar

我正在嘗試構建一個模塊，並且任務將是單擊文章鏈接並提取日期，作者和文章標題。並且爲所有文章做這個鏈接網頁（cato.org/research/34/commentary）。請幫忙 – Shad

這應該工作。所有你需要的就是運行這個命令： scrapy runspider cato.py -o out.json 但我所看到的，有錯誤的鏈接，你將只從文字鏈接，而不是HREF

import scrapy 

class CatoItem(scrapy.Item): 
    date = scrapy.Field() 
    author = scrapy.Field() 
    links = scrapy.Field() 


class CatoSpider(scrapy.Spider): 

    name = 'cato' 

    allowed_domains = ['cato.org'] 

    start_urls = ['https://www.cato.org/research/34/commentary'] 


    def parse(self, response): 
     date = response.css('span.date-display-single::text').extract() 
     author = response.css('p.text-sans::text').extract() 
     links = response.css('p.text-large.experts-more-h > a::text').extract() 
     for d, a, l in zip(date, author, links): 
      item = CatoItem() 
      item['date'] = d 
      item['author'] = a 
      item['links'] = l 
      yield item

來源

2017-09-06 06:02:18 AndMar

謝謝。我無法說我多麼感激。只需要問一個「Catoitem」課程是分開還是必須與第二個蜘蛛部分一起使用？ – Shad

您可以將'CatoItem'放入與您的蜘蛛相同的模塊中，但這是不好的做法，最好將它放在'items.py'中，因爲將來可能有許多蜘蛛，並且它很容易從一個模塊。 – AndMar

你是一個拯救生命的人。 – Shad

Scrapy與Scrapy蜘蛛

回答

相關問題