與美麗的網頁抓取

我想解析網站只有特定的部分。以下是我的代碼如下。無論如何要做到更高效。與美麗的網頁抓取

from bs4 import BeautifulSoup 
import requests 
import urllib.request 
import json 

soup = BeautifulSoup(requests.get("http://www.example.com").content, "html.parser") 

for d in soup.select("script[type=text/javascript]"): 
    print(d.text[2300:2600])

這裏是輸出什麼，我需要

> dataLayer = [{ 
>  'page':'ProductPage', 
>  'OAM':'False', 
>  'storeNum':'075', 
>  'brand':'Seagate', 
>  'productPrice':'69.99', 
>  'SKU':'106674', 
>  'productID':'467336', 
>  'mpn':'ST2000DM006', 
>  'ean':'763649110218', 
>  'category':'Internal Hard Drives', 
>  'isMobile':'False' }];

來源

2016-10-05 Burak

它可以改變其他頁面上 - （我沒有與其他頁籤）

for d in soup.select("script[type=text/javascript]")[27].text.split('\n')[51:62]: 
    print(d.strip())

結果

'page':'ProductPage', 
'OAM':'False', 
'storeNum':'029', 
'brand':'Microsoft', 
'productPrice':'129.99', 
'SKU':'883785', 
'productID':'456088', 
'mpn':'QC7-00001', 
'ean':'889842010060', 
'category':'Tablet Accessories', 
'isMobile':'False'

編輯：其他版本：

text = soup.select("head script[type=text/javascript]")[-1].text 

start = text.find('dataLayer = [{') + len('dataLayer = [{') 
end = text.rfind('}];') 

rows = text[start:end].strip().split('\n') 

for d in rows: 
    print(d.strip())

來源

2016-10-05 21:30:20 furas

感謝完美的作品。 – Burak

與美麗的網頁抓取

回答

相關問題