javascript
  • python
  • html
  • web-scraping
  • scrapy
  • 2015-04-02 38 views 1 likes 
    1

    我在Python中使用scrapy,並想檢索位於另一個「擴展」元素後面的元素的內容。在檢查DOM樹時,直到第一次單擊父元素時纔會加載div標籤和文本本身。一旦父母被點擊,文本可以被重新隱藏,但至少會在DOM中。在python中,如何讓scrapy返回隱藏的元素的內容?

    示例網站是here。我正在尋找抽象文本(直到單擊「摘要」鏈接才加載)。

    鬥志旺盛的命令是: response.xpath("//div[@class='previewBox abstract hidden']").extract()而是返回了一堆空的div像這樣: u'<div id="abs_S0740002015000179" class="previewBox abstract hidden"></div>'

    如果我用這個:response.xpath("//div[@class='previewBox abstract']").extract()那麼它不返回任何東西。

    回答

    1

    您需要模擬額外的HTTP GET請求正在發送的abtract鏈接點擊。

    這個想法是提取並請求「Abstract」鏈接的data-url屬性值。從"Scrapy Shell"

    演示:

    $ scrapy shell "http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=-764831607&_sort=r&_st=13&view=c&md5=a41e9f25739feae932862575251c1e0d&searchtype=a" 
    In [1]: url = response.xpath("//a[@data-type='abstract']/@data-url").extract()[0] 
    In [2]: fetch(url) 
    In [3]: print "".join(response.xpath("//div[@class='articleText']//text()").extract()) 
    AbstractThe aim of the present study was to investigate the effect of lactic acid against Shiga toxin producing Escherichia coli (O157:H7 and non-O157 serogroups including O103, O111, O145 and O26) at different conditions. Soybean sprouts and spinach leaves inoculated with each serogroup of E. coli (∼7.00 + 1.00 log10 cfu/g) were treated with the lactic acid solutions at different concentrations (0% (control), 1.5%, 2.0%, or 2.5%) and at different temperatures (20, 40, or 50 °C) for 3 min. Results indicated that regardless of the treatment temperature, no significant reduction in the numbers of any serogroup occurred in the control group (0%) (p > 0.05). However, lactic acid at concentration of 1.5%, 2% and 2.5% was found to be effective against all organisms tested. There was no significant difference (p > 0.05) between E. coli O157:H7 and non-O157 STEC serogroups at any treatment group. The highest reductions (ca. 4.00 log10 cfu/g) of all serotypes in both produces were observed after immersing into 2.5% lactic acid at 50 °C. The results of this study showed that decontamination of fresh produces such as spinach and soybean sprout with lactic acid solutions prepared at mild temperatures (40 °C and 50 °C) might be an effective safety measure in preventing public health risks associated with these products contaminated with STEC. 
    

    注意,這fetch()調用是使外殼其他請求一種特殊的方式。在您的Scrapy蜘蛛中,您需要yieldreturnscrapy.http.Request()實例並解析callback中的結果。

    +0

    我明白你在做什麼,這是有道理的......但它似乎ID必須通過獲取命令循環,我是否也需要在蜘蛛中做到這一點?如果我在這裏放置一個循環(所以url是一個url列表),print命令只返回頁面上的最後一個摘要......我在這裏做錯了什麼? – claudiaann1 2015-04-02 22:49:29

    +0

    @ claudiaann1蜘蛛中你需要循環鏈接,產生對數據url的請求並獲取回調中響應的內容。讓我知道你是否需要一個示例代碼。同時,我認爲我們可以解決這個問題。謝謝。 – alecxe 2015-04-02 22:59:59

    相關問題