2016-10-10 24 views
0

我有一個看起來像這樣的HTML元素:如何分組XPath?

enter image description here

我想組h1div.article-metadiv.article-content,這樣我就可以寫循環一行的數據線在我的Scrapy項目。

我正在考慮將它們中的每一個分組到一個var,然後循環該var,我不知道如何去做。

請建議。謝謝,

到目前爲止,我已經試過這樣:

def parse(self, response): 
    now = time.strftime('%Y-%m-%d %H:%M:%S') 
    hxs = scrapy.Selector(response) 

    titles = hxs.xpath('//div[@class="list-article"]/h1') 
    images = hxs.xpath('//div[@class="list-article"]/feature-image') 
    contents = hxs.xpath('//div[@class="list-article"]/article-content') 

    for i, title in titles: 
     item = DapnewsItem() 
     item['categoryId'] = '1' 

     name = titles[i].xpath('a/text()') 
     if not name: 
      print('DAP => [' + now + '] No title') 
     else: 
      item['name'] = name.extract()[0] 

     description = contents[i].xpath('p/text()') 
     if not description: 
      print('DAP => [' + now + '] No description') 
     else: 
      item['description'] = description[1].extract() 

     url = titles[i].xpath("a/@href") 
     if not url: 
      print('DAP => [' + now + '] No url') 
     else: 
      item['url'] = url.extract()[0] 

     imageUrl = images[i].xpath('img/@src') 
     if not imageUrl: 
      print('DAP => [' + now + '] No imageUrl') 
     else: 
      item['imageUrl'] = imageUrl.extract()[0] 

     yield item 

這是我得到的錯誤。

enter image description here

+0

您好,我已經更新了我的SOFAR – Vicheanak

回答

1

讓我們用這個HTML片段來說明:

<div class="list-article"> 

    <h1><a href="http//www.example.com/article1.html">Title 1</h1> 
    <div class="article-meta">Something for 1</div> 
    <div class="feature-image"><img src="http://www.example.com/image1.jpg"></div> 
    <div class="article-content"><p>Content 1</p></div> 

    <h1><a href="http//www.example.com/article2.html">Title 2</h1> 
    <div class="article-meta">Something for 2</div> 
    <div class="feature-image"><img src="http://www.example.com/image2.jpg"></div> 
    <div class="article-content"><p>Content 2</p></div> 

    <h1><a href="http//www.example.com/article3.html">Title 3</h1> 
    <div class="article-meta">Something for 3</div> 
    <div class="feature-image"><img src="http://www.example.com/image3.jpg"></div> 
    <div class="article-content"><p>Content 3</p></div> 

</div> 

你可以循環每個<h1>並使用XPath's following-sibling axis檢查哪些元素在樹同級別來後,再過濾在第一個:例如following-sibling::div[@class="feature-image"][1]第一個<div class="feature-image">

>>> selector = scrapy.Selector(text='''<div class="list-article"> 
... 
...  <h1><a href="http//www.example.com/article1.html">Title 1</h1> 
...  <div class="article-meta">Something for 1</div> 
...  <div class="feature-image"><img src="http://www.example.com/image1.jpg"></div> 
...  <div class="article-content"><p>Content 1</p></div> 
... 
...  <h1><a href="http//www.example.com/article2.html">Title 2</h1> 
...  <div class="article-meta">Something for 2</div> 
...  <div class="feature-image"><img src="http://www.example.com/image2.jpg"></div> 
...  <div class="article-content"><p>Content 2</p></div> 
... 
...  <h1><a href="http//www.example.com/article3.html">Title 3</h1> 
...  <div class="article-meta">Something for 3</div> 
...  <div class="feature-image"><img src="http://www.example.com/image3.jpg"></div> 
...  <div class="article-content"><p>Content 3</p></div> 
...  
... </div>''') 

>>> for h in selector.css('div.list-article > h1'): 
...  item = { 
...   'title': h.xpath('a/text()').extract_first(), 
...   'image': h.xpath(''' 
...    following-sibling::div[@class="feature-image"][1] 
...     /img/@src''').extract_first(), 
...   'content': h.xpath(''' 
...    following-sibling::div[@class="article-content"][1] 
...     /p/text()''').extract_first(), 
...  } 
...  print(item) 
... 
{'content': u'Content 1', 'image': u'http://www.example.com/image1.jpg', 'title': u'Title 1'} 
{'content': u'Content 2', 'image': u'http://www.example.com/image2.jpg', 'title': u'Title 2'} 
{'content': u'Content 3', 'image': u'http://www.example.com/image3.jpg', 'title': u'Title 3'} 
>>> 
+0

工作的偉大的答案!非常感謝你。 – Vicheanak