如何從Scrapy中提取網頁中的所有內容

-2

我使用Scapy1.4通過指定一組URL來從網頁上抓取內容。我需要如何從頁面中提取各種信息，例如URL的標題，正文。如何從Scrapy中提取網頁中的所有內容

目前，我使用下面的URL

https://healthlibrary.epnet.com/GetContent.aspx?token=3bb6e77f-7239-4082-81fb-4aeb0064ca19&chunkiid=32905

而且我的代碼是

class gsapocSpider(BaseSpider): 
    name = "gsapoc" 
    start_urls =["https://healthlibrary.epnet.com/GetContent.aspx?token=3bb6e77f-7239-4082-81fb-4aeb0064ca19&chunkiid=32905"] 
    def parse(self, response): 
     responseSelector = Selector(response) 
     for sel in responseSelector.xpath('//ul/li'): 
      item = GsapocItem() 
      item['title'] = sel.xpath('//ul/li/a/text()').extract() 
      item['link'] = sel.xpath('a/@href').extract() 
      item['body'] = sel.xpath('//body//p//text()').extract() 
      #item['text'] = sel.xpath('//text()').extract() 
      #body = response.xpath('//body//p//text()').extract() 
      #print(body) 
      yield item

來源

2017-09-26 Shankar Rao

我不明白爲什麼設置XPath表達這樣。頁面中甚至沒有ul元素。

由於您的目標只是爲了獲取網址，標題和正文。以下是一些建議：

URL。您可以從response獲取URL response.url
標題。根據您要查找的標題類型，有兩種選擇：title標記和指定的元素。
身體。你想要整個頁面還是僅僅是文本？如果前者，response.body沒問題，並且如果後者，您需要指定如何提取所有內容。

無論如何，我認爲你需要一些關於HTML和XPath的知識。

謝謝。

來源

2017-09-28 21:45:16 rojeeer

如何從Scrapy中提取網頁中的所有內容

回答

相關問題