我是Scrapy的新手,我無法設法獲得回調函數的工作。我設法得到所有的URL,並且我設法在回調函數中關注它們,但是當我得到結果時,我多次收到一些結果,並且很多結果都丟失了。什麼似乎是問題?Scrapy回調函數多次返回相同的結果
import scrapy
from kexcrawler.items import KexcrawlerItem
class KexSpider(scrapy.Spider):
name = 'kex'
allowed_domains = ["kth.diva-portal.org"]
start_urls = ['http://kth.diva-portal.org/smash/resultList.jsf?dswid=-855&language=en&searchType=RESEARCH&query=&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%7B%22dateIssued%22%3A%7B%22from%22%3A%222015%22%2C%22to%22%3A%222015%22%7D%7D%2C%7B%22organisationId%22%3A%225956%22%2C%22organisationId-Xtra%22%3Atrue%7D%2C%7B%22publicationTypeCode%22%3A%5B%22article%22%5D%7D%2C%7B%22contentTypeCode%22%3A%5B%22refereed%22%5D%7D%5D%5D&aqe=%5B%5D&noOfRows=250&sortOrder=author_sort_asc&onlyFullText=false&sf=all']
def parse(self, response):
for href in response.xpath('//li[@class="ui-datalist-item"]/div[@class="searchItem borderColor"]/a/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
item = KexcrawlerItem()
item['report'] = response.xpath('//div[@class="toSplitVertically"]/div[@id="innerEastCenter"]/span[@class="displayFields"]/span[@class="subTitle"]/text()').extract()
yield item
這些結果的第一行:
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Comparing Vocal Fold Contact Criteria Derived From Audio and Electroglottographic Signals"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]},
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]},
{"report": ["Dynamic message-passing approach for kinetic spin models with reversible dynamics"]},
{"report": ["RNA editing of non-coding RNA and its role in gene regulation"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["Security monitor inlining and certification for multithreaded Java"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},
我敢肯定,你絕對不應該重寫'scrapy'中的'parse'方法 - 這是它的大部分實現是 – gtlambert
@gtlambert這是不正確的,你必須重寫解析方法,因爲這是條目點的scrapy。你的意思是在使用LinkExtractor時:在這種情況下,你不能重寫'parse'方法,因爲它有一個必需的默認實現(或者你可以自己實現,但在這種情況下,在提取發動機中)。 – GHajba
@Agnes您是否看過您的'parse'方法提供給新請求的URL? Scrapy不會過濾結果,而是加載URL。如果最終在URL中有一些會話參數,您可以獲得多個結果。如果要過濾結果,請創建一個自定義項目導出器,該導出器標記已導出的元素並對其進行過濾。 – GHajba