僅在scrapy中返回特定網址

我正在使用scrapy從網站刮取網址。目前它返回所有網址，但我希望它只返回包含單詞「下載」的網址。我怎樣才能做到這一點？僅在scrapy中返回特定網址

from scrapy.selector import HtmlXPathSelector 
from scrapy.spider import BaseSpider 
from scrapy.http import Request 
import scrapy 

DOMAIN = 'somedomain.com' 
URL = 'http://' +str(DOMAIN) 

class MySpider(scrapy.Spider): 
    name = DOMAIN 
    allowed_domains = [DOMAIN] 
    start_urls = [ 
     URL 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     for url in hxs.select('//a/@href').extract(): 
      if not (url.startswith('http://') or url.startswith('https://')): 
       url= URL + url 
      print url 
      yield Request(url, callback=self.parse)

編輯：

我實現了以下建議。它仍然會拋出一些錯誤，但至少只返回包含下載的鏈接。

from scrapy.selector import HtmlXPathSelector 
from scrapy.spider import BaseSpider 
from scrapy.http import Request 
import scrapy 
from scrapy.linkextractors import LinkExtractor 


DOMAIN = 'somedomain.com' 
URL = 'http://' +str(DOMAIN) 

class MySpider(scrapy.Spider): 
    name = DOMAIN 
    allowed_domains = [DOMAIN] 
    start_urls = [ 
     URL 
    ] 

# First parse returns all the links of the website and feeds them to parse2 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     for url in hxs.select('//a/@href').extract(): 
      if not (url.startswith('http://') or url.startswith('https://')): 
       url= URL + url 
      yield Request(url, callback=self.parse2) 

# Second parse selects only the links that contains download 

    def parse2(self, response): 
     le = LinkExtractor(allow=("download")) 
     for link in le.extract_links(response): 
       yield Request(url=link.url, callback=self.parse2) 
       print link.url

來源

2017-03-27 LuukS

一個更Python和乾淨的解決方案，是使用LinkExtractor：

from scrapy.linkextractors import LinkExtractor 

... 

le = LinkExtractor(deny="download") 
for link in le.extract_links(response): 
    yield Request(url=link.url, callback=self.parse)

來源

2017-03-27 16:29:51 eLRuLL

謝謝，我做了它的工作，但這樣代碼就會拒絕每個鏈接的'下載'，所以這至少有一半，但我怎麼能做到這一點呢？ – LuukS

檢查[LinkExtractor文檔]（https://doc.scrapy.org/en/latest/topics/link-extractors.html），它還提供'allow'屬性，因此您可以創建另一個LinkExtractor實例。 – eLRuLL

剛剛嘗試過，我得到了與dot.Py的答案相同的警告：ScrapyDeprecationWarning：模塊scrapy.spider已棄用，請使用scrapy.spider而不是來自scrapy.spider的 import BaseSpider – LuukS

您試圖檢查一個字符串中是否存在子字符串。

例子：

string = 'this is a simple string' 

'simple' in string 
True 

'zimple' in string 
False

所以，你只需要添加一個if語句，如：

：

if 'download' in url:

行後210

即：

for url in hxs.select('//a/@href').extract(): 
    if 'download' in url: 
     if not (url.startswith('http://') or url.startswith('https://')): 
      url = URL + url 
     print url 
     yield Request(url, callback=self.parse)

因此，如果條件'download' in url返回True如果鏈接開始與http://代碼將只檢查。

來源

2017-03-27 16:18:06

謝謝你，我想這樣的事情我自己。當我這樣做時，scrapy會拋出一個錯誤，並告訴我使用其選擇器之一，css或xpath來代替... – LuukS

解決了，它確實有效，但我的爬蟲阻止了它自己的爬行路徑。 – LuukS

僅在scrapy中返回特定網址

回答

相關問題