2013-10-30 81 views
0

我在做解析。我想要獲取描述標籤內的圖片。我正在使用urllib和BeautifulSoup。我可以獲取單獨標籤內的圖像,但無法以編碼格式獲取描述標籤內的圖像。使用Beautifulsoup提取img內部的xml文件的描述標籤

XML代碼

<item> 
     <title>Kidnapped NDC member and political activist tells his story</title> 
     <link>http://www.yementimes.com/en/1724/news/3065</link> 
     <description>&lt;img src="http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg" border="0" align="left" hspace="5" /&gt; 
‘I kept telling them that they would never break me and that the change we demanded in 2011 would come whether they wanted it or not’ 
&lt;br clear="all"&gt;</description> 

views.py

for q in b.findAll('item'): 
      d={} 
      d['desc']=strip_tags(q.description.string).strip('&nbsp') 
      if q.guid: 
       d['link']=q.guid.string 
      else: 
       d['link']=strip_tags(q.comments) 
      d['title']=q.title.string 
      for r in q.findAll('enclosure'): 
       d['image']=r['url'] 
      arr.append(d) 

任何人都可以,請給我一個想法做吧..
這是我已經做了解析單獨的內部圖像標籤... 我試圖得到,如果它是內部描述,但我不能。

回答

0

你可以嘗試從<description>提取所有內容,創建一個新的BeautifulSoup對象與它搜索第一<img>元素src屬性:

from bs4 import BeautifulSoup 
import sys 
import html.parser 

h = html.parser.HTMLParser() 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html') 
for i in soup.find_all('item'): 
    d = BeautifulSoup(h.unescape(i.description.string)) 
    print(d.img['src']) 

運行它想:

python3 script.py xmlfile 

那產量:

http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg 
相關問題