2014-03-25 65 views
2
def parse_linkpage(self, response): 
    hxs = HtmlXPathSelector(response) 
    item = QualificationItem() 
    xpath = """ 
      //h2[normalize-space(.)="Entry requirements for undergraduate courses"] 
      /following-sibling::p 
      """ 
    item['Qualification'] = hxs.select(xpath).extract()[1:] 
    item['Country'] = response.meta['a_of_the_link'] 
    return item 

所以我想知道是否可以讓我的代碼在<h2>結束後停止刮取。只能在特定標題後才能刪除內容嗎?

這裏是網頁:

<h2>Entry requirements for undergraduate courses</h2> 
<p>Example1</p> 
<p>Example2</p> 
<h2>Postgraduate Courses</h2> 
<p>Example3</p> 
<p>Example4</p> 

我想這些結果:

Example1 
Example2 

,但我得到:

Example1 
Example2 
Example3 
Example4 

我知道我可以改變這一行,

item['Qualification'] = hxs.select(xpath).extract() 

到,

item['Qualification'] = hxs.select(xpath).extract()[0:2] 

但這刮看,可能有2周以上的段落在第一頭這意味着它會離開這個信息了許多不同的頁面。

我想知道是否有一種方法,只是告訴它提取我想要的標題後面的確切數據,而不是一切?

回答

2

這不是很漂亮或容易讀,但你可以用EXSLT擴展XPath和使用set:difference()操作:

>>> selector.xpath(""" 
    set:difference(//h2[normalize-space(.)="Entry requirements for undergraduate courses"] 
        /following-sibling::p, 
        //h2[normalize-space(.)="Entry requirements for undergraduate courses"] 
        /following-sibling::h2[1] 
        /following-sibling::p)""").extract() 
[u'<p>Example1</p>', u'<p>Example2</p>'] 

的想法是選擇所有p目標h2以下,並排除那些p這在接下來的h2

在一個有點易於閱讀的版本後:

>>> for h2 in selector.xpath('//h2[normalize-space(.)="Entry requirements for undergraduate courses"]'): 
...  paragraphs = h2.xpath("""set:difference(./following-sibling::p, 
...            ./following-sibling::h2[1]/following-sibling::p)""").extract() 
...  print paragraphs 
... 
[u'<p>Example1</p>', u'<p>Example2</p>'] 
>>> 
0

也許你可以使用此XPath

//h2[normalize-space(.)="Entry requirements for undergraduate courses"] 
     /following-sibling::p[not(preceding-sibling::h2[normalize-space(.)!="Entry requirements for undergraduate courses"])] 

你可以添加following-sibling::p的另一個謂詞不包括那些p(胡)的前同輩不等於

「的本科課程入學要求」
相關問題