2014-05-19 85 views
0

我想提取iframe src給定一組使用Python的網站。例如,我的輸入是A.com,B.com,C.com,如果每個站點都有鏈接到D.com,E.com,F.com的iframe(如果網站沒有有一個iframe),那麼我想輸出爲形式的東西:使用Python提取HTML鏈接

Site Iframe Src 
A.com D.com 
B.com E.com 
C.com F.com 

目前,我有這樣的事情:

from collections import defaultdict 
import urllib2 
import re 

def PrintLinks(website): 
counter = 0 
regexp_link= regexp_link = r'''<frame src =((http|ftp)s?://.*?)''' 
pattern = re.compile(regexp_link) 
links = [None]*len(website) 
for x in website: 
    html_page = urllib2.urlopen(website[counter]) 
    html = html_page.read() 
    links[counter] = re.findall(pattern,html) 
    counter += 1 
return links 

def main(): 
website=["A.com","B.com","C.com"] 

這是做到這一點的最佳方式,以及如何將我得到的輸出是我想要的格式?謝謝!

回答

0

你不需要重新發明使用正則表達式的車輪,有真棒蟒蛇包,爲你做這件事,是最有名的BeautifulSoup。

安裝BeautifulSouphttplib2與點子,並嘗試這個


import httplib2 
from BeautifulSoup import BeautifulSoup, SoupStrainer 

sites=['http://www.site1.com', 'http://www.site2.com', 'http://www.site3.com'] 
http = httplib2.Http() 

for site in sites: 
    status, response = http.request(site) 
    for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')): 
     print site + ' ' + iframe['src']