2015-12-31 111 views
1

我試圖從我的項目的網站刮取數據。但問題是我沒有在我的輸出中獲取標籤,這是我在開發人員工具欄屏幕上看到的。下面是DOM的從我就想湊數據快照:使用BeautifulSoup刮掉隱藏的元素

<div class="bigContainer"> 
     <!-- ngIf: products.grid_layout.length > 0 --><div ng-if="products.grid_layout.length > 0"> 
     <div class="fl"> 
      <!-- ngRepeat: product in products.grid_layout --><!-- ngIf: $index%3==0 --> 
      <div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems"> 
      <grid-item product="product" gakey="ga_key" idx="$index" ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter"> 
       <a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style=""> 
      </grid-item> 

我能夠得到帶班「bigContainer」 div標籤,但我不能給這個標籤內颳去標籤。例如,如果我想獲取網格項標籤,我得到一個空的列表,這意味着它顯示沒有這樣的標籤。這是爲什麼發生?請幫忙!!

+0

請分享您迄今爲止寫的代碼。 – JRodDynamite

+0

r = requests.get(url) soup = BeautifulSoup(r.content,「html.parser」) plink = soup.find_all(「div」,{「class」:「f1」})[0] .find_all (「grid-item」)[0] – Nain

+2

檢查傳遞給'BeautifulSoup'的HTML(即'r.content')。它可能與開發人員工具欄顯示的HTML不同。如果它缺少''標記,JavaScript可能被用於將內容插入到網頁中。如果是這種情況,您需要[支持JavaScript的瀏覽器(如Selenium)](http://stackoverflow.com/q/17436014/190597)來獲取內容。 – unutbu

回答

2

您可以使用底層web-api來提取由angularJS javascript框架呈現的網格項細節,因此HTML不是靜態的。

解析的一種方法是使用硒來獲取數據,但使用瀏覽器的開發工具來識別web-api是非常簡單的。

編輯:我用Firebug插件與Firefox看到從 「網絡標籤」

enter image description here

和頁面的GET請求作出的GET請求是:

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2

而且它返回了一個回調JS腳本,它幾乎完全是JSON數據。

它傳回的JSON包含的細節

每個網格項目被形容爲像下面一個JSON對象將網格項目:

{ 
     "product_id": 23491960, 
     "complex_product_id": 7287171, 
     "name": "Samsung Galaxy Z1 (Black)", 
     "short_desc": "", 
     "bullet_points": { 
      "salient_feature": ["Screen: 10.16 cm (4\")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "Battery: 1500 mAh/Li-Ion"] 
     }, 
     "url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745", 
     "seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745", 
     "url_type": "product", 
     "promo_text": null, 
     "image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg", 
     "vertical_id": 18, 
     "vertical_label": "Mobile", 
     "offer_price": 5090, 
     "actual_price": 5799, 
     "merchant_name": "SMARTBUY", 
     "authorised_merchant": false, 
     "stock": true, 
     "brand": "Samsung", 
     "tag": "+5% Cashback", 
     "product_tag": "+5% Cashback", 
     "shippable": true, 
     "created_at": "2015-09-17T08:28:25.000Z", 
     "updated_at": "2015-12-29T05:55:29.000Z", 
     "img_width": 400, 
     "img_height": 400, 
     "discount": "12" 
    } 

所以,你可以得到的細節,甚至沒有在使用beautifulSoup以下方式。

import requests 
import json 

response = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2") 
jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0]) 
data = json.loads(jsonResponse) 
print(data["grid_layout"]) 
grid_data = data["grid_layout"] 

for grid_item in grid_data: 
    print("Brand:", grid_item["brand"]) 
    print("Product Name:", grid_item["name"]) 
    print("Current Price: Rs", grid_item["offer_price"]) 
    print("==================") 

你會得到輸出如下

Brand: Samsung 
Product Name: Samsung Galaxy Z1 (Black) 
Current Price: Rs 4990 
================== 
Brand: Samsung 
Product Name: Samsung Galaxy A7 (Gold) 
Current Price: Rs 22947 
================== 

希望這有助於。

0

您可以使用「用戶代理」來獲取完整的數據。嘗試像這樣

Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0").timeout(10*1000).get();