這是要做到這一點的方法。首先下載頁面,刮擦它找到你正在尋找的模型,然後你可以獲得鏈接到新頁面進行刮擦。這裏不需要javascript。這個模型和BeautifulSoup文檔將幫助你。
from BeautifulSoup import BeautifulSoup
import urllib2
base_url = 'http://www.ksl.com'
url = base_url + '/index.php?nid=443'
model = "Honda" # this is the name of the model to look for
# Load the page and process with BeautifulSoup
handle = urllib2.urlopen(url)
html = handle.read()
soup = BeautifulSoup(html)
# Collect all the ad detail boxes from the page
divs = soup.findAll(attrs={"class" : "detailBox"})
# For each ad, get the title
# if it contains the word "Honda", get the link
for div in divs:
title = div.find(attrs={"class" : "adTitle"}).text
if model in title:
link = div.find(attrs={"class" : "listlink"})["href"]
link = base_url + link
# Now you have a link that you can download and scrape
print title, link
else:
print "No match: ", title
在回答的那一刻,這個代碼片斷是尋找本田車型和返回如下:
1995- Honda Prelude http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817797
No match: 1994- Ford Escort
No match: 2006- Land Rover Range Rover Sport
No match: 2006- Nissan Maxima
No match: 1957- Volvo 544
No match: 1996- Subaru Legacy
No match: 2005- Mazda Mazda6
No match: 1995- Chevrolet Monte Carlo
2002- Honda Accord http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817784
No match: 2004- Chevrolet Suburban (Chevrolet)
1998- Honda Civic http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817779
No match: 2004- Nissan Titan
2001- Honda Accord http://www.ksl.com/index.php?sid=0&nid=443&tab=list/view&ad=8817770
No match: 1999- GMC Yukon
No match: 2007- Toyota Tacoma
你是想通過解析與Python的HTML來提取網頁的JavaScript?從你的問題來看,這不是很清楚。 – MikeWyatt
我只對寶馬感興趣,因此,我想在我試圖解析html之前過濾我的結果 –
我想借此機會鏈接到[歷史上最流行的答案](http ://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)! –