2015-10-06 34 views
0

我試圖刮下面的HTML代碼的標題:是否有scrapy跟隨同胞計數?

<FONT COLOR=#5FA505><B>Claim:</B></FONT> &nbsp; Coed makes unintentionally risqu&eacute; remark about professor's "little quizzies." 
<BR><BR> 
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER> 

我使用這個代碼:

def parse_article(self, response): 
      for href in response.xpath('//font[b = "Claim:"]/following-sibling::text()'): 
         print href.extract() 

,我成功地拉了正確的Claim:值,我從想前面提到過的html代碼,但是也有(在同一頁面中具有類似結構的其他代碼)拉下面的html。我正在定義我的xpath()只需拉入名爲Claim:font標記,那麼爲什麼它也拉動下面的Origins?我該如何解決它?我想看到的,如果我能得到的只是下一個following-sibling,而不是所有的人,但沒有奏效

<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> &nbsp; Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends 
+0

'.extract()[0]' –

+0

@JohnDene我的輸出變化,但它只是一堆空的空間,偶爾會出現','每隔一段時間 – Rafa

+1

我認爲這是您正在使用for循環的bcoz。如果我知道它是正確的,你只想提取一個值? –

回答

0

我覺得你的XPath是缺少text()預選賽(解釋here)。它應該是:

'//font/[b/text()="Claim:"]/following-sibling::text()' 
+0

仍然給了我相同的輸出。同時拉動'起源'。 – Rafa

0

following-sibling軸將返回一個元素後面的所有兄弟元素。如果你只想要第一個兄弟,嘗試XPath表達式:

//font[b = "Claim:"]/following-sibling::text()[1] 

,或者根據您的具體使用案例:

(//font[b = "Claim:"]/following-sibling::text())[1]