2016-05-05 71 views
0

我想鑽取並使用beautifulsoup4獲取div的src和href。我已經閱讀了看過教程的文檔並搜索了帖子,並沒有找到。繼承人的HTML代碼我如何鑽取並獲取div的src和href使用beautifulsoup4

<div class="thumbnail thumb"> 
    <h6 id="date">May 2, 2016</h6> 
     <img src="http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg" class="img-responsive post"> 

       <div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div> 

       <div class="caption" id="cap"> 
        <a href="/blog/just-filler/"> 
         <h5 class="post-title" id="title">just filler</h5> 
        </a> 

        <p> 
         <a href="/blog/36/delete/" class="btn" role="button">delete</a> 
         <a href="/blog/just-filler/edit/" class="btn" role="button">edit</a> 
        </p> 

       </div> 
</div> 

我已經試過這

entries = [{'text': div.text, 
      'href': div.get('div', {'class', 'thumbnail'}).a, 
      'src': div.get('src') 
      } for div in divs] 

,但它不工作

我在我的Django應用程序USINT這一點。什麼是正確的語法來刮擦href和src。該文本的作品不是src和href。

回答

2

BeautifulSoup可能有這樣做的更聰明,內置的方式,但是這似乎工作:

from bs4 import BeautifulSoup as soup 

html = """ 
<div class="thumbnail thumb"> 
    <h6 id="date">May 2, 2016</h6> 
     <img src="http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg" class="img-responsive post"> 

       <div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div> 

       <div class="caption" id="cap"> 
        <a href="/blog/just-filler/"> 
         <h5 class="post-title" id="title">just filler</h5> 
        </a> 

        <p> 
         <a href="/blog/36/delete/" class="btn" role="button">delete</a> 
         <a href="/blog/just-filler/edit/" class="btn" role="button">edit</a> 
        </p> 

       </div> 
</div> 
""" 

parsed = soup(html, "html.parser") 

divs = parsed.find_all("div") 

entries = [{'text': div.text, 
      'src' : map(lambda img : img.get("src"), div.find_all('img')), 
      'href': map(lambda a : a.get("href"), div.find_all('a')) 
      } for div in divs if "thumbnail" in div.get("class", [])] 

print entries 

輸出:

[{'text': u'\nMay 2, 2016\n\n\n\n\njust filler\n\n\ndelete\nedit\n\n\n', 'href': [u'/blog/just-filler/', u'/blog/36/delete/', u'/blog/just-filler/edit/'], 'src': [u'http://www.viveca.net/wp-content/uploads/2012/02/End_of_the_Line025.jpg']}] 
0

這在我的觀點曾

entries = [{'text': div.text, 
      'href': div.find('a').get('href'), 
      'src': div.find('img').get('src') 
      } for div in divs] 

和我的模板

{% for e in entries %} 
    <a href="{{url}}{{ e.href }}" class="thumbnail">{{ e.text }}</a><br> 
{{e.href}}<br>