2015-10-06 102 views
1

我想刮下面的HTML代碼的標題無標籤段:我怎樣才能湊與Scrapy

def parse_article(self, response): 
       for href in response.xpath('//font[@color="#5FA505"]/'): 

,但標題(男女同校:

<FONT COLOR=#5FA505><B>Claim:</B></FONT> &nbsp; Coed makes unintentionally risqu&eacute; remark about professor's "little quizzies." 
<BR><BR> 
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER> 

我已經嘗試使用無意中......)實際上並沒有嵌入到任何標籤中,所以我一直無法獲得該內容。有沒有一種方法可以在不嵌入<p>或任何標籤的情況下獲取內容?

編輯://font[b = "Claim:"]/following-sibling::text()工程,但它也抓住並顯示這個底部的一塊HTML。

<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> &nbsp; Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends 

回答

1

假設你知道,還有就是Claim:文本事先通過其b孩子的文本找到font標籤,並獲得following text sibling:從Scrapy Shell

//font[b = 'Claim:']/following-sibling::text() 

演示:

In [1]: "".join(map(unicode.strip, response.xpath("//font[b = 'Claim:']/following-sibling::text()").extract())) 
Out[1]: u'Coed makes unintentionally risqu\xe9 remark about professor\'s "little quizzies."' 

請注意,這些連接和剝離調用應理想地由Item Loaders內使用的相應輸入或輸出處理器所取代。

+0

它的工作原理,我接受了答案,但請看看我的編輯 – Rafa