使用Python提取HTML鏈接

我想提取iframe src給定一組使用Python的網站。例如，我的輸入是A.com，B.com，C.com，如果每個站點都有鏈接到D.com，E.com，F.com的iframe（如果網站沒有有一個iframe），那麼我想輸出爲形式的東西：使用Python提取HTML鏈接

Site Iframe Src 
A.com D.com 
B.com E.com 
C.com F.com

目前，我有這樣的事情：

from collections import defaultdict 
import urllib2 
import re 

def PrintLinks(website): 
counter = 0 
regexp_link= regexp_link = r'''<frame src =((http|ftp)s?://.*?)''' 
pattern = re.compile(regexp_link) 
links = [None]*len(website) 
for x in website: 
    html_page = urllib2.urlopen(website[counter]) 
    html = html_page.read() 
    links[counter] = re.findall(pattern,html) 
    counter += 1 
return links 

def main(): 
website=["A.com","B.com","C.com"]

這是做到這一點的最佳方式，以及如何將我得到的輸出是我想要的格式？謝謝！

來源

2014-05-19 user3330107

你不需要重新發明使用正則表達式的車輪，有真棒蟒蛇包，爲你做這件事，是最有名的BeautifulSoup。

安裝BeautifulSoup和httplib2與點子，並嘗試這個

import httplib2 
from BeautifulSoup import BeautifulSoup, SoupStrainer 

sites=['http://www.site1.com', 'http://www.site2.com', 'http://www.site3.com'] 
http = httplib2.Http() 

for site in sites: 
    status, response = http.request(site) 
    for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')): 
     print site + ' ' + iframe['src']

來源

2014-05-20 00:06:08 SDude

使用Python提取HTML鏈接

回答

相關問題