2017-03-27 83 views
1

我正在使用scrapy從網站刮取網址。目前它返回所有網址,但我希望它只返回包含單詞「下載」的網址。我怎樣才能做到這一點?僅在scrapy中返回特定網址

from scrapy.selector import HtmlXPathSelector 
from scrapy.spider import BaseSpider 
from scrapy.http import Request 
import scrapy 

DOMAIN = 'somedomain.com' 
URL = 'http://' +str(DOMAIN) 

class MySpider(scrapy.Spider): 
    name = DOMAIN 
    allowed_domains = [DOMAIN] 
    start_urls = [ 
     URL 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     for url in hxs.select('//a/@href').extract(): 
      if not (url.startswith('http://') or url.startswith('https://')): 
       url= URL + url 
      print url 
      yield Request(url, callback=self.parse) 

編輯:

我實現了以下建議。它仍然會拋出一些錯誤,但至少只返回包含下載的鏈接。

from scrapy.selector import HtmlXPathSelector 
from scrapy.spider import BaseSpider 
from scrapy.http import Request 
import scrapy 
from scrapy.linkextractors import LinkExtractor 


DOMAIN = 'somedomain.com' 
URL = 'http://' +str(DOMAIN) 

class MySpider(scrapy.Spider): 
    name = DOMAIN 
    allowed_domains = [DOMAIN] 
    start_urls = [ 
     URL 
    ] 

# First parse returns all the links of the website and feeds them to parse2 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     for url in hxs.select('//a/@href').extract(): 
      if not (url.startswith('http://') or url.startswith('https://')): 
       url= URL + url 
      yield Request(url, callback=self.parse2) 

# Second parse selects only the links that contains download 

    def parse2(self, response): 
     le = LinkExtractor(allow=("download")) 
     for link in le.extract_links(response): 
       yield Request(url=link.url, callback=self.parse2) 
       print link.url 

回答

2

一個更Python和乾淨的解決方案,是使用LinkExtractor

from scrapy.linkextractors import LinkExtractor 

... 

le = LinkExtractor(deny="download") 
for link in le.extract_links(response): 
    yield Request(url=link.url, callback=self.parse) 
+0

謝謝,我做了它的工作,但這樣代碼就會拒絕每個鏈接的'下載',所以這至少有一半,但我怎麼能做到這一點呢? – LuukS

+0

檢查[LinkExtractor文檔](https://doc.scrapy.org/en/latest/topics/link-extractors.html),它還提供'allow'屬性,因此您可以創建另一個LinkExtractor實例。 – eLRuLL

+0

剛剛嘗試過,我得到了與dot.Py的答案相同的警告:ScrapyDeprecationWarning:模塊scrapy.spider已棄用,請使用scrapy.spider而不是來自scrapy.spider的 import BaseSpider – LuukS

1

您試圖檢查一個字符串中是否存在子字符串。

例子:

string = 'this is a simple string' 

'simple' in string 
True 

'zimple' in string 
False 

所以,你只需要添加一個if語句,如:

if 'download' in url:

行後210

即:

for url in hxs.select('//a/@href').extract(): 
    if 'download' in url: 
     if not (url.startswith('http://') or url.startswith('https://')): 
      url = URL + url 
     print url 
     yield Request(url, callback=self.parse) 

因此,如果條件'download' in url返回True如果鏈接開始與http://代碼將只檢查。

+0

謝謝你,我想這樣的事情我自己。當我這樣做時,scrapy會拋出一個錯誤,並告訴我使用其選擇器之一,css或xpath來代替... – LuukS

+0

解決了,它確實有效,但我的爬蟲阻止了它自己的爬行路徑。 – LuukS