Scrapy Spider返回最後一個元素時，給出一個選擇器列表

我已經遇到了一個問題，我已經把一個蜘蛛放在一起。我試圖從this site的抄本中找出各行文字以及相應的時間戳，並找到了我認爲合適的選擇器，但運行時，蜘蛛的輸出只是最後一行和時間戳。我見過一些其他類似問題的人，但還沒有找到解決我的問題的答案。Scrapy Spider返回最後一個元素時，給出一個選擇器列表

這裏是蜘蛛：

# -*- coding: utf-8 -*- 
import scrapy 
from this_american_life.items import TalTranscriptItem 

class CrawlSpider(scrapy.Spider): 
    name = "transcript2" 
    allowed_domains = ["https://www.thisamericanlife.org/radio-archives/episode/1/transcript"] 
    start_urls = (
     'https://www.thisamericanlife.org/radio-archives/episode/1/transcript', 
    ) 

    def parse(self, response): 
     item = TalTranscriptItem() 
     for line in response.xpath('//p'): 
      item['begin_timestamp'] = line.xpath('//@begin').extract() 
      item['line_text'] = line.xpath('//text()').extract() 
     yield item

這裏是在items.py爲TalTranscriptItem()代碼：

# -*- coding: utf-8 -*- 

# Define here the models for your scraped items 
# 
# See documentation in: 
# http://doc.scrapy.org/en/latest/topics/items.html 

import scrapy 


class TalTranscriptItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    episode_id = scrapy.Field() 
    episode_num_text = scrapy.Field() 
    year = scrapy.Field() 
    radio_date_text = scrapy.Field() 
    radio_date_datetime = scrapy.Field() 
    episode_title = scrapy.Field() 
    episode_hosts = scrapy.Field() 
    act_id = scrapy.Field() 
    line_id = scrapy.Field() 
    begin_timestamp = scrapy.Field() 
    speaker_class = scrapy.Field() 
    speaker_name = scrapy.Field() 
    line_text = scrapy.Field() 
    full_audio_link = scrapy.Field() 
    transcript_url = scrapy.Field()

當scrapy shell運行，它似乎正常工作（繪製所有線路的文字），但由於某種原因，我還沒有能夠得到它在蜘蛛的工作。

我很高興澄清任何這些問題，並將不勝感激任何人都可以提供的幫助！

來源

2017-10-19 Chris Jewell

'TalTranscriptItem'是什麼類型？ – Hackerman

@Hackerman我會將TalTranscriptItem的代碼添加到問題中。它是scrapy項目目錄中items.py文件的一個類。 –

如果我沒有記錯，'scrapy.Field（）'是一個普通的舊python字典，而不是一個列表 – Hackerman

我不知道是什麼項目，但你可以這樣做：

item = [] 

for line in response.xpath('//p'): 
    dictItem = {'begin_timestamp':line.xpath('//@begin').extract(),'line_text':line.xpath('//text()').extract()} 
    item.append(dictItem) 

print(item)

來源

2017-10-19 20:38:21 Wandrille

謝謝，這在scrapy外殼中工作，但由於某些原因，它仍然只是在蜘蛛中運行時拉出最後一個元素。 –

如果你想每一個人行得到，因爲我覺得一個項目，這是你想要的（注意爲yield行的最後一個縮進）：

for line in response.css('p'): 
    item = TalTranscriptItem() 
    item['begin_timestamp'] = line.xpath('./@begin').extract_first() 
    item['line_text'] = line.xpath('./text()').extract_first() 
    yield item

來源

2017-10-23 07:50:42 Wilfredo

謝謝！這似乎是有道理的，但由於某種原因，它仍然只返回最後一項，即使在scrapy shell中也是如此。任何想法爲什麼這可能是？再次感謝 –

你能告訴我你如何測試它，它可以在我的外殼中正常工作 – Wilfredo

Scrapy Spider返回最後一個元素時，給出一個選擇器列表

回答

相關問題