我正在使用scrapy從網站刮取網址。目前它返回所有網址,但我希望它只返回包含單詞「下載」的網址。我怎樣才能做到這一點?僅在scrapy中返回特定網址
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN)
class MySpider(scrapy.Spider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not (url.startswith('http://') or url.startswith('https://')):
url= URL + url
print url
yield Request(url, callback=self.parse)
編輯:
我實現了以下建議。它仍然會拋出一些錯誤,但至少只返回包含下載的鏈接。
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
from scrapy.linkextractors import LinkExtractor
DOMAIN = 'somedomain.com'
URL = 'http://' +str(DOMAIN)
class MySpider(scrapy.Spider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
# First parse returns all the links of the website and feeds them to parse2
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not (url.startswith('http://') or url.startswith('https://')):
url= URL + url
yield Request(url, callback=self.parse2)
# Second parse selects only the links that contains download
def parse2(self, response):
le = LinkExtractor(allow=("download"))
for link in le.extract_links(response):
yield Request(url=link.url, callback=self.parse2)
print link.url
謝謝,我做了它的工作,但這樣代碼就會拒絕每個鏈接的'下載',所以這至少有一半,但我怎麼能做到這一點呢? – LuukS
檢查[LinkExtractor文檔](https://doc.scrapy.org/en/latest/topics/link-extractors.html),它還提供'allow'屬性,因此您可以創建另一個LinkExtractor實例。 – eLRuLL
剛剛嘗試過,我得到了與dot.Py的答案相同的警告:ScrapyDeprecationWarning:模塊scrapy.spider已棄用,請使用scrapy.spider而不是來自scrapy.spider的 import BaseSpider – LuukS