2017-03-04 26 views
-1

如何從具有class和id的html文件中選擇scrapy中的數據?

<div class="section-body" id="section-2"><p>Most people with aortic stenosis do not develop symptoms until the disease is advanced. The diagnosis may have been made when the health care provider heard a heart murmur and performed tests.</p><p>Symptoms of aortic stenosis include:</p><ul><li>Chest discomfort: The chest pain may get worse with activity and reach into the arm, neck, or jaw. The chest may also feel tight or squeezed.</li><li>Cough, possibly bloody.</li><li>Breathing problems when exercising.</li><li>Becoming easily tired.</li><li>Feeling the heartbeat (palpitations).</li><li>Fainting, weakness, or dizziness with activity.</li></ul><p>In infants and children, symptoms include:</p><ul><li>Becoming easily tired with exertion (in mild cases)</li><li>Failure to gain weight</li><li>Poor feeding</li><li>Serious breathing problems that develop within days or weeks of birth (in severe cases)</li></ul><p>Children with mild or moderate aortic stenosis may get worse as they get older. They are also at risk for a heart infection called bacterial endocarditis.</p></div></div></section>

我上面的腳本,我想放棄在列表中的數據。即在 我已經在scrapy中嘗試了以下命令,但無法正常工作。它將'[]'作爲輸出。

response.css("article div.section-body p").extract() <-- this is giving all info under section body but I want only under section-2 
    response.css("article div.section-body.section-2 p::text").extract() 
response.xpath("//article/*[contains(@id, 'setion-2')]").extract() 

請幫我解壓。由於

回答

0

嘗試

response.css("article div.section-body#section-2 p::text").extract() 

div.section-body#section-2是指同時具有section-body class和id section-2

注意,ID是由#選擇類是由.選擇......所以你的CSS選擇器張貼在選擇DIV你的問題是錯誤的。

+0

進口scrapy 類QuotesSpider(scrapy.Spider): 名稱= 「醫學」 start_urls = [ 'https://開頭medlineplus.gov/ency /條/ 000178.htm'] DEF解析(self,response): yeild {主題:'response.css('title :: text')。extract_first文本「)。extract() }當我運行這個 - > scrapy抓取醫療-o medical.json 它沒有給任何輸出ut在json文件中。 –

+0

是否在CLI /終端中顯示Scrapy日誌中的抓取數據? – Umair

+0

不,它沒有顯示要抓取的數據,它在終端上顯示一些錯誤------ >> Traceback(最近呼叫的最後一個): 文件「c:\ python27 \ lib \ site-packages \ twisted \ internet \ defer.py「,第653行,在_runCallbacks中 current.result = callback(current.result,* args,** kw) 文件」F:\ tutorial \ tutorial \ spiders \ quotes_spider.py「,第11行,在解析中 yeild NameError:全局名稱'yeild'未定義 我嘗試過縮進校正但未起作用 –

相關問題