2017-05-28 41 views
0

我寫了一個腳本,並使用Scrapy在第一階段查找鏈接,並在第二階段中按照鏈接和頁面提取內容。 Scrapy它,但它遵循一個無序的方式鏈接,即我期望的輸出如下:使Scrapy按照鏈接順序

link1 | data_extracted_from_link1_destination_page 
link2 | data_extracted_from_link2_destination_page 
link3 | data_extracted_from_link3_destination_page 
. 
. 
. 

,但我得到

link1 | data_extracted_from_link2_destination_page 
link2 | data_extracted_from_link3_destination_page 
link3 | data_extracted_from_link1_destination_page 
. 
. 
. 

這裏是我的代碼:

import scrapy 


class firstSpider(scrapy.Spider): 
    name = "ipatranscription" 
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html'] 

    def parse(self, response): 
     body = response.xpath('./body/div[3]/div[1]/div/a') 
     LinkTextSelector = './text()' 
     LinkDestSelector = './@href' 

     for link in body: 
      LinkText = link.xpath(LinkTextSelector).extract_first() 
      LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first()) 

      yield {"LinkText": LinkText} 
      yield scrapy.Request(url=LinkDest, callback=self.parse_contents) 

    def parse_contents(self, response): 

     lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract() 
     sContent = "" 
     for i in lContent: 
      sContent += i 
     sContent = sContent.replace("\n", "").replace("\t", "") 
     yield {"LinkContent": sContent} 

我的代碼有什麼問題?

回答

1

產量不同步,你應該使用meta來實現這一點。 文件:https://doc.scrapy.org/en/latest/topics/request-response.html
代碼:

import scrapy 
class firstSpider(scrapy.Spider): 
    name = "ipatranscription" 
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html'] 
    def parse(self, response): 
     body = response.xpath('./body/div[3]/div[1]/div/a') 
     LinkTextSelector = './text()' 
     LinkDestSelector = './@href' 
     for link in body: 
      LinkText = link.xpath(LinkTextSelector).extract_first() 
      LinkDest = 
       response.urljoin(link.xpath(LinkDestSelector).extract_first()) 
      yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText}) 

    def parse_contents(self, response): 
     lContent = 
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract() 
     sContent = "" 
     for i in lContent: 
      sContent += i 
     sContent = sContent.replace("\n", "").replace("\t", "") 
     linkText = response.meta['LinkText'] 
     yield {"LinkContent": sContent,"LinkText": linkText}