2015-01-07 83 views
0

我在這裏有源代碼http://pastebin.com/rxK0mnVj。我想檢查源代碼以在Image標籤中包含blz-src屬性,並檢查src是否不包含數據uri,然後返回true或false。如何使用BeautifulSoup找到特定標記

例如,

<img src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAQAICRAEAOw==" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a> 

應該返回False作爲data-blzsrc屬性存在,但src屬性包含data:

但這,

<img src="http://images.akam.net/img1.jpg" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a> 

應該返回True,因爲它含有data-blzsrc屬性和src不包含data:

如何在BeautifulSoup中實現此目的。

回答

0

嘗試查找所有圖像,並檢查attr是否存在並檢查src屬性內容。看看這個腳本:

from bs4 import BeautifulSoup 
html = """ 
<img src="data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAQAICRAEAOw==" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a> 
<img src="http://images.akam.net/img1.jpg" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" width="324" height="60" alt="StrawberryNET" /></a> 
""" 

soup = BeautifulSoup(html) 
for img in soup.findAll('img'): 
    #here is your desired conditions 
    if img.has_attr('data-blzsrc') and not img.attrs.get('src','').startswith('data:'): 
     print img 

它打印出所需的IMG節點

<img alt="StrawberryNET" data-blzsrc="http://1.resources.newtest.strawberrynet.com.edgesuite.net/4/C/lyiTlubX4.webp" height="60" src="http://images.akam.net/img1.jpg" width="324"/> 
+1

'src'用'數據開始:' – alecxe

+0

感謝@alecxe! – xecgr

+0

實際上'src'不應該包含'data' – station

1

如果你想找到所有img標籤並進行測試,使用find_all()並檢查屬性,例如:

from bs4 import BeautifulSoup 

soup = BeautifulSoup(open('index.html')) 

def check_img(img): 
    return 'data-blzsrc' in img.attrs and 'data' not in img.get('src', '') 

for img in soup.find_all('img'): 
    print img, check_img(img) 

如果要過濾出符合條件的圖像,可以將attrs參數傳遞給find_all()以提供一個dic tionary。設置data-blzsrcTrue執行它的存在,用一個函數來檢查的src值不包含data

for img in soup.find_all('img', attrs={'data-blzsrc': True, 'src': lambda x: x and 'data' not in x}): 
    print img 
+0

src'不應該包含數據' – station

+0

@ user567797第二個解決方案的優雅 –

+0

另外在第二個我們傳遞html來源 – station

相關問題