2015-06-28 92 views
1

我是一名HTML和網頁抓取的初學者,並試圖使用Python BeautifulSoup獲取下面顯示的數據。在URL中查找特定的框架以使用Python刮取數據BeautifulSoup

[ 
Theft06/24/15 08:47 PM2000 BLOCK OF S COLLEGE AV 

Vandalism06/24/15 07:32 PM3600 BLOCK OF WELLBORN RD 

Theft06/24/15 07:30 PM800 BLOCK OF RIO GRANDE LN 

Theft06/24/15 06:40 PM1800 BLOCK OF FINFEATHER RD 
] 

但是,當我分析該網站http://spotcrime.com/#77801,我看不到在解析URL的div中不能得到的數據。

,我使用的代碼是:

html=urllib2.urlopen('http://spotcrime.com/#77801') 

soup = BeautifulSoup(html.read()) 
print soup 

回答

0

代替主犯罪容器中,有僅此接收由urlopen

<div id="table_container" class="list-group crime-list" style="margin-top: -30px;"> 
    <h3>Loading Crime Data...</h3> 
    <p>City and county crime map showing crime incident data down to neighborhood crime</p> 
</div> 

這是因爲,使主容器用另外的API調用的幫助構造成http://api.spotcrime.com/crimes.json端點和正在執行的JavaScript邏輯在瀏覽器中。

你可以做的是在你的代碼中用requests模擬那個API調用。工作示例:

import requests 

url = "http://spotcrime.com/#77801" 
crimes_url = "http://api.spotcrime.com/crimes.json" 

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36'} 
with requests.Session() as session: 
    session.headers = headers 

    session.get(url) 

    data = { 
     "lat": "30.6423514", 
     "lon": "-96.3704778", 
     "radius": "0.02", 
     "key": "spotcrime-private-api-key", 
     "_": "1435453242689" 
    } 
    response = session.get(crimes_url, data=data) 
    response = response.json() 
    for item in response["crimes"]: 
     print item 

它打印每一行相對應的犯罪表字典:

{u'cdid': 64482204, u'lon': -96.3661035, u'lat': 30.6507387, u'link': u'http://spotcrime.com/crime/64482204-6737a0085bd9aff31548993910efa35a', u'address': u'2000 BLOCK OF S COLLEGE AV', u'date': u'06/24/15 08:47 PM', u'type': u'Theft'} 
{u'cdid': 64482189, u'lon': -96.3594859, u'lat': 30.6299681, u'link': u'http://spotcrime.com/crime/64482189-345f4eca1c977f43e97ea4981f73d4de', u'address': u'3600 BLOCK OF WELLBORN RD', u'date': u'06/24/15 07:32 PM', u'type': u'Vandalism'} 
... 
{u'cdid': 64370976, u'lon': -96.361556, u'lat': 30.631685, u'link': u'http://spotcrime.com/crime/64370976-dc6e6dbb29fc7376c2b82356c45d281d', u'address': u'3600 BLOCK OF WELLBORN RD #802', u'date': u'06/18/15 12:37 PM', u'type': u'Arrest'} 
{u'cdid': 64371003, u'lon': -96.3539954, u'lat': 30.6434707, u'link': u'http://spotcrime.com/crime/64371003-d9934d9b9d83c1867871701874c45523', u'address': u'2900 BLOCK OF S TEXAS AVENUE', u'date': u'06/18/15 09:56 AM', u'type': u'Vandalism'} 
1

你不能找到DIV因爲它是動態加載和由JavaScript插入。然而,在這種情況下你可以做的是複製ajax請求來獲取所有這些犯罪數據。

現在看來似乎其內部API並不需要任何形式的認證,所以你可以先走一步,發送以下API請求: GET api.spotcrime.com/crimes.json?lat=30.639155&lon=-96.3647937&radius=0.02&key=spotcrime-private-api-key

作爲獎勵,你不需要刮該網站也是如此,因爲一切都以JSON對象整齊地返回。

相關問題