使Scrapy按照鏈接順序

我寫了一個腳本，並使用Scrapy在第一階段查找鏈接，並在第二階段中按照鏈接和頁面提取內容。 Scrapy它，但它遵循一個無序的方式鏈接，即我期望的輸出如下：使Scrapy按照鏈接順序

link1 | data_extracted_from_link1_destination_page 
link2 | data_extracted_from_link2_destination_page 
link3 | data_extracted_from_link3_destination_page 
. 
. 
.

，但我得到

link1 | data_extracted_from_link2_destination_page 
link2 | data_extracted_from_link3_destination_page 
link3 | data_extracted_from_link1_destination_page 
. 
. 
.

這裏是我的代碼：

import scrapy 


class firstSpider(scrapy.Spider): 
    name = "ipatranscription" 
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html'] 

    def parse(self, response): 
     body = response.xpath('./body/div[3]/div[1]/div/a') 
     LinkTextSelector = './text()' 
     LinkDestSelector = './@href' 

     for link in body: 
      LinkText = link.xpath(LinkTextSelector).extract_first() 
      LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first()) 

      yield {"LinkText": LinkText} 
      yield scrapy.Request(url=LinkDest, callback=self.parse_contents) 

    def parse_contents(self, response): 

     lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract() 
     sContent = "" 
     for i in lContent: 
      sContent += i 
     sContent = sContent.replace("\n", "").replace("\t", "") 
     yield {"LinkContent": sContent}

我的代碼有什麼問題？

來源

2017-05-28 Gmosy Gnaq

產量不同步，你應該使用meta來實現這一點。文件：https://doc.scrapy.org/en/latest/topics/request-response.html
代碼：

import scrapy 
class firstSpider(scrapy.Spider): 
    name = "ipatranscription" 
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html'] 
    def parse(self, response): 
     body = response.xpath('./body/div[3]/div[1]/div/a') 
     LinkTextSelector = './text()' 
     LinkDestSelector = './@href' 
     for link in body: 
      LinkText = link.xpath(LinkTextSelector).extract_first() 
      LinkDest = 
       response.urljoin(link.xpath(LinkDestSelector).extract_first()) 
      yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText}) 

    def parse_contents(self, response): 
     lContent = 
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract() 
     sContent = "" 
     for i in lContent: 
      sContent += i 
     sContent = sContent.replace("\n", "").replace("\t", "") 
     linkText = response.meta['LinkText'] 
     yield {"LinkContent": sContent,"LinkText": linkText}

來源

2017-05-29 01:12:45

使Scrapy按照鏈接順序

回答

相關問題