我已經嘗試了3種不同的LinkExtractor變體,但它仍然忽略了所有3個變體中的「拒絕」規則和爬行子域....我想排除從爬行。Scrapy:Linkextractor規則不起作用
只用'允許'規則試過。只允許主域即example.edu.uk
rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
與「拒絕」唯一的規則嘗試。要拒絕所有子域即sub.example.edu.uk
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working
既嘗試 '允許&否認' 的規則
rules = [Rule(LinkExtractor(allow=(r'^http:\/\/example\.edu\.uk(\/.*)?$'),deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working
例子:
關注這些鏈接
- example.edu.uk/fsdfs.htm
- example.edu.uk/nkln.htm
- example.edu.uk/vefr.htm
- example.edu.uk/opji.htm
棄子域名鏈接
- sub-domain.example.edu.uk/fsdfs.htm
- sub-domain.example.edu.uk/nkln.htm
- sub-domain.example.edu.uk/vefr.htm
- sub-domain.example.edu.uk/opji.htm
下面是完整的代碼...
class NewsFields(Item):
pagetype = Field()
pagetitle = Field()
pageurl = Field()
pagedate = Field()
pagedescription = Field()
bodytext = Field()
class MySpider(CrawlSpider):
name = 'profiles'
start_urls = ['http://www.example.edu.uk/listing']
allowed_domains = ['example.edu.uk']
rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*',))),)
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"nkdpagetype"})
ptitle = soup.find_all(attrs={"name":"nkdpagetitle"})
pturl = soup.find_all(attrs={"name":"nkdpageurl"})
ptdate = soup.find_all(attrs={"name":"nkdpagedate"})
ptdesc = soup.find_all(attrs={"name":"nkdpagedescription"})
for node in soup.find_all("div", id="main-content__wrapper"):
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//p/a/@href').extract():
yield Request(response.urljoin(url), callback=self.parse)
是否有人可以幫忙嗎? 感謝
發佈一些樣例鏈接也希望被處理和那些你不想處理 –
也請發佈時,你說不工作,什麼是我們在發生什麼?如果可能的話發佈日誌 –
嗨@TarunLalwani你在我的問題中不理解的是什麼?必須對主域中的所有鏈接進行爬網,並且必須丟棄子域下的所有鏈接。無論如何,我已經更新了這個問題。往上看。 – Slyper