0
我寫了一個腳本,並使用Scrapy在第一階段查找鏈接,並在第二階段中按照鏈接和頁面提取內容。 Scrapy它,但它遵循一個無序的方式鏈接,即我期望的輸出如下:使Scrapy按照鏈接順序
link1 | data_extracted_from_link1_destination_page
link2 | data_extracted_from_link2_destination_page
link3 | data_extracted_from_link3_destination_page
.
.
.
,但我得到
link1 | data_extracted_from_link2_destination_page
link2 | data_extracted_from_link3_destination_page
link3 | data_extracted_from_link1_destination_page
.
.
.
這裏是我的代碼:
import scrapy
class firstSpider(scrapy.Spider):
name = "ipatranscription"
start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']
def parse(self, response):
body = response.xpath('./body/div[3]/div[1]/div/a')
LinkTextSelector = './text()'
LinkDestSelector = './@href'
for link in body:
LinkText = link.xpath(LinkTextSelector).extract_first()
LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first())
yield {"LinkText": LinkText}
yield scrapy.Request(url=LinkDest, callback=self.parse_contents)
def parse_contents(self, response):
lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
sContent = ""
for i in lContent:
sContent += i
sContent = sContent.replace("\n", "").replace("\t", "")
yield {"LinkContent": sContent}
我的代碼有什麼問題?