2013-10-22 18 views
2

我正在將網頁加載到iframe中,我想確保使所有關聯的媒體都可用。我目前正在使用請求下載頁面,然後進行一些查找/替換,但這並沒有完全覆蓋。有沒有辦法用python來獲取頁面在加載到瀏覽器時所做的所有腳本,css和圖像請求的列表?使用請求或在Python中機械化加載所有第三方腳本

回答

3

BeautifulSoup

使用BeautifulSoup4讓所有的<img><link><script>標籤,然後拉出相應的屬性。

from bs4 import BeautifulSoup 
import requests 

resp = requests.get("http://www.yahoo.com") 

soup = BeautifulSoup(resp.text) 

# Pull the linked images (note: will grab base64 encoded images) 
images = [img['src'] for img in soup.findAll('img') if img.has_key('src')] 

# Checking for src ensures that we don't grab the embedded scripts 
scripts = [script['src'] for script in soup.findAll('script') if script.has_key('src')] 

# favicon.ico and css 
links = [link['href'] for link in soup.findAll('link') if link.has_key('href')] 

輸出示例:

In [30]: images = [img['src'] for img in soup.findAll('img') if img.has_key('src')] 

In [31]: images[:5] 
Out[31]: 
['http://l.yimg.com/dh/ap/default/130925/My_Yahoo_Defatul_HP_ad_300x250.jpeg', 
'http://l.yimg.com/os/mit/media/m/base/images/transparent-95031.png', 
'http://l.yimg.com/os/mit/media/m/base/images/transparent-95031.png', 
'http://l.yimg.com/os/mit/media/m/base/images/transparent-95031.png', 
'http://l.yimg.com/os/mit/media/m/base/images/transparent-95031.png']