2016-03-09 88 views
1

我是Scrapy的新手,我無法設法獲得回調函數的工作。我設法得到所有的URL,並且我設法在回調函數中關注它們,但是當我得到結果時,我多次收到一些結果,並且很多結果都丟失了。什麼似乎是問題?Scrapy回調函數多次返回相同的結果

import scrapy 

from kexcrawler.items import KexcrawlerItem 

class KexSpider(scrapy.Spider): 
    name = 'kex' 
    allowed_domains = ["kth.diva-portal.org"] 
    start_urls = ['http://kth.diva-portal.org/smash/resultList.jsf?dswid=-855&language=en&searchType=RESEARCH&query=&af=%5B%5D&aq=%5B%5B%5D%5D&aq2=%5B%5B%7B%22dateIssued%22%3A%7B%22from%22%3A%222015%22%2C%22to%22%3A%222015%22%7D%7D%2C%7B%22organisationId%22%3A%225956%22%2C%22organisationId-Xtra%22%3Atrue%7D%2C%7B%22publicationTypeCode%22%3A%5B%22article%22%5D%7D%2C%7B%22contentTypeCode%22%3A%5B%22refereed%22%5D%7D%5D%5D&aqe=%5B%5D&noOfRows=250&sortOrder=author_sort_asc&onlyFullText=false&sf=all'] 

def parse(self, response): 
    for href in response.xpath('//li[@class="ui-datalist-item"]/div[@class="searchItem borderColor"]/a/@href'): 
     url = response.urljoin(href.extract()) 
     yield scrapy.Request(url, callback=self.parse_dir_contents) 

def parse_dir_contents(self, response): 
    item = KexcrawlerItem() 
    item['report'] = response.xpath('//div[@class="toSplitVertically"]/div[@id="innerEastCenter"]/span[@class="displayFields"]/span[@class="subTitle"]/text()').extract() 
     yield item 

這些結果的第一行:

{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]}, 
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]}, 
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]}, 
{"report": ["Comparing Vocal Fold Contact Criteria Derived From Audio and Electroglottographic Signals"]}, 
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]}, 
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]}, 
{"report": ["On Multiple Reconnection ", "-Lines and Tripolar Perturbations of Strong Guide Magnetic Fields"]}, 
{"report": ["Four-Component Relativistic Calculations in Solution with the Polarizable Continuum Model of Solvation: Theory, Implementation, and Application to the Group 16 Dihydrides H2X (X = O, S, Se, Te, Po)"]}, 
{"report": ["Dynamic message-passing approach for kinetic spin models with reversible dynamics"]}, 
{"report": ["RNA editing of non-coding RNA and its role in gene regulation"]}, 
{"report": ["Security monitor inlining and certification for multithreaded Java"]}, 
{"report": ["Security monitor inlining and certification for multithreaded Java"]}, 
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]}, 
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]}, 
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]}, 
{"report": ["On the electron dynamics during island coalescence in asymmetric magnetic reconnection"]}, 
+0

我敢肯定,你絕對不應該重寫'scrapy'中的'parse'方法 - 這是它的大部分實現是 – gtlambert

+0

@gtlambert這是不正確的,你必須重寫解析方法,因爲這是條目點的scrapy。你的意思是在使用LinkExtractor時:在這種情況下,你不能重寫'parse'方法,因爲它有一個必需的默認實現(或者你可以自己實現,但在這種情況下,在提取發動機中)。 – GHajba

+0

@Agnes您是否看過您的'parse'方法提供給新請求的URL? Scrapy不會過濾結果,而是加載URL。如果最終在URL中有一些會話參數,您可以獲得多個結果。如果要過濾結果,請創建一個自定義項目導出器,該導出器標記已導出的元素並對其進行過濾。 – GHajba

回答

0

我試圖複製你的錯誤,無法。所有的網址都是不同的。我在INFO級別記錄了每個項目,並且壓制了下面的所有項目,並發現每個報告也是唯一的。我的確打電話給你,因爲它給我帶來了一個錯誤,並用一個字段定義了你的物品類。如果您直接從終端複製並粘貼,那麼我認爲它是印刷品的產品,而不是日誌,這讓我認爲可能有多個打印電話在不同時間接到電話。嘗試在某處寫入文件並查看是否有重複文件。爲了測試url是否是唯一的,我將xpath中的元素提取到名爲elem的列表中: print len(elem) b = set() for e in elem: b.add(e) print len(b) 您可以嘗試製作全局項目列表,然後添加一個函數spider_closed,該函數將在關閉時自動調用,然後在該列表上執行相同的操作。集只包含獨特的元素,如果有差異,那麼你實際上是在創建重複。