Scrapy回調函數多次返回相同的結果

我是Scrapy的新手，我無法設法獲得回調函數的工作。我設法得到所有的URL，並且我設法在回調函數中關注它們，但是當我得到結果時，我多次收到一些結果，並且很多結果都丟失了。什麼似乎是問題？Scrapy回調函數多次返回相同的結果

import scrapy 

from kexcrawler.items import KexcrawlerItem 

class KexSpider(scrapy.Spider): 
    name = 'kex' 
    allowed_domains = ["kth.diva-portal.org"] 
    start_urls = ['http://kth.diva-portal.org/smash/resultList.jsf?dswid=-855&language=en&searchType=RESEARCH&query=&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%7B%22dateIssued%22%3A%7B%22from%22%3A%222015%22%2C%22to%22%3A%222015%22%7D%7D%2C%7B%22organisationId%22%3A%225956%22%2C%22organisationId-Xtra%22%3Atrue%7D%2C%7B%22publicationTypeCode%22%3A%5B%22article%22%5D%7D%2C%7B%22contentTypeCode%22%3A%5B%22refereed%22%5D%7D%5D%5D&aqe=%5B%5D&noOfRows=250&sortOrder=author_sort_asc&onlyFullText=false&sf=all'] 

def parse(self, response): 
    for href in response.xpath('//li[@class="ui-datalist-item"]/div[@class="searchItem borderColor"]/a/@href'): 
     url = response.urljoin(href.extract()) 
     yield scrapy.Request(url, callback=self.parse_dir_contents) 

def parse_dir_contents(self, response): 
    item = KexcrawlerItem() 
    item['report'] = response.xpath('//div[@class="toSplitVertically"]/div[@id="innerEastCenter"]/span[@class="displayFields"]/span[@class="subTitle"]/text()').extract() 
     yield item

這些結果的第一行：

{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]}, 
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]}, 
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]}, 
{"report": ["Comparing Vocal Fold Contact Criteria Derived From Audio and Electroglottographic Signals"]}, 
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]}, 
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]}, 
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]}, 
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]}, 
{"report": ["Dynamic message-passing approach for kinetic spin models with reversible dynamics"]}, 
{"report": ["RNA editing of non-coding RNA and its role in gene regulation"]}, 
{"report": ["Security monitor inlining and certification for multithreaded Java"]}, 
{"report": ["Security monitor inlining and certification for multithreaded Java"]}, 
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]}, 
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]}, 
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]}, 
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]},

來源

2016-03-09 Agnes Åman

我敢肯定，你絕對不應該重寫'scrapy'中的'parse'方法 - 這是它的大部分實現是 – gtlambert

@gtlambert這是不正確的，你必須重寫解析方法，因爲這是條目點的scrapy。你的意思是在使用LinkExtractor時：在這種情況下，你不能重寫'parse'方法，因爲它有一個必需的默認實現（或者你可以自己實現，但在這種情況下，在提取發動機中）。 – GHajba

@Agnes您是否看過您的'parse'方法提供給新請求的URL？ Scrapy不會過濾結果，而是加載URL。如果最終在URL中有一些會話參數，您可以獲得多個結果。如果要過濾結果，請創建一個自定義項目導出器，該導出器標記已導出的元素並對其進行過濾。 – GHajba

我試圖複製你的錯誤，無法。所有的網址都是不同的。我在INFO級別記錄了每個項目，並且壓制了下面的所有項目，並發現每個報告也是唯一的。我的確打電話給你，因爲它給我帶來了一個錯誤，並用一個字段定義了你的物品類。如果您直接從終端複製並粘貼，那麼我認爲它是印刷品的產品，而不是日誌，這讓我認爲可能有多個打印電話在不同時間接到電話。嘗試在某處寫入文件並查看是否有重複文件。爲了測試url是否是唯一的，我將xpath中的元素提取到名爲elem的列表中： print len(elem) b = set() for e in elem: b.add(e) print len(b) 您可以嘗試製作全局項目列表，然後添加一個函數spider_closed，該函數將在關閉時自動調用，然後在該列表上執行相同的操作。集只包含獨特的元素，如果有差異，那麼你實際上是在創建重複。

來源

2016-03-10 23:56:53

Scrapy回調函數多次返回相同的結果

回答

相關問題