Scrapy xpath aria-select = false

我想從一些Khan Academy視頻中使用scrapy獲取轉錄信息。例如：https://www.khanacademy.org/math/algebra-basics/basic-alg-foundations/alg-basics-negative-numbers/v/opposite-of-a-number Scrapy xpath aria-select = false

，當我試圖通過選擇的XPath response.xpath('//div[contains(@role, "tablist")]/a').extract()的成績單按鈕，我只得到了有關選項卡中的信息有aria-selected="true"這是關於部分。我需要使用scrapy在抄本按鈕中將aria-selected從false更改爲true，然後檢索必要的信息。

任何人都可以請澄清我將如何能夠做到這一點？

非常感謝！

來源

2016-08-14 abarbosa

你的意思是成績單的文字？ –

是的！ – abarbosa

如果你看看你的網絡檢查，你可以看到一個AJAX請求正在取得檢索談話一次頁面加載：

在這種情況下，它https://www.khanacademy.org/api/internal/videos/2Zk6u7Uk5ow/transcript?casing=camel&locale=en&lang=en 這似乎使用YouTube視頻網址ID創建此api網址。所以你可以很容易地重新創建它：

import json 
import scrapy 
class MySpider(scrapy.Spider): 
    #... 
    transcript_url_template = 'https://www.khanacademy.org/api/internal/videos/{}/transcript?locale=en&lang=en' 

    def parse(self, response): 
     # find youtube id 
     youtube_id = response.xpath("//meta[@property='og:video']/@content").re_first('v/(.+)') 
     # create transcript API url using the youtube id 
     url = self.transcript_url_template.format(youtube_id) 
     # download the data and parse it 
     yield Request(url, self.parse_transript) 

    def parse_transcript(self, response): 
     # convert json data to python dictionary 
     data = json.loads(response.body) 
     # parse your data!

來源

2016-08-15 03:00:07 Granitosaurus

的成績單文本我想知道URL的「套件=駱駝」部分是什麼 - 它提供了完全相同的數據，無論它有沒有... –

@JannieGerber是的，似乎是這樣。可能是不推薦使用的參數，或者只能影響在此特定示例中不存在的某些字段。無論如何，這個例子似乎沒有必要 – Granitosaurus

Scrapy xpath aria-select = false

回答

相關問題