2017-02-14 131 views
0

我使用urlopen和beautifulsoup4來獲取網頁的內容。 我要抓取的網頁會生成一些動態的JavaScript塊。 我想提取整個數組的內容。從html中提取數組元素

陣列是按以下格式:

<script type="text/javascript"> 
var jobmap = {}; 
jobmap[0]= {jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'}; 
jobmap[1]= {jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'}; 
</script> 

該數組包含一個未知的數量的元件。 如何提取整個數組的內容並將其保存到json對象中?

+0

這是一個本發明的課題(https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object)不是數組。 – wpercy

+0

如果您有動態內容,Beautilsoup和urlopen是解決問題的錯誤方法 –

+0

@ cricket_007我認爲這取決於..,有時javascript內容存在於HTML中(通常在腳本標記中),並且有意義的是轉到「簡單「的urlopen /請求方法,以避免基於瀏覽器或JavaScript引擎的方式的開銷和緩慢。儘管如此,這裏通常比較脆弱。這可能不是嚴格的「錯誤」,但更像是「謹慎使用和理解」:) – alecxe

回答

2

BeautifulSoup只能幫助解決問題的一部分 - 找到包含所需對象的期望script元素。然後,你需要爲使用JavaScript分析器,像slimit,或正則表達式,例如,沿着這些路線的東西:

import json 
import re 
from bs4 import BeautifulSoup 


data = """ 
<script type="text/javascript"> 
var jobmap = {}; 
jobmap[0]= {jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'}; 
jobmap[1]= {jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'}; 
</script>""" 

soup = BeautifulSoup(data, "html.parser") 
script = soup.find("script", text=lambda text: "var jobmap" in text) 

pattern = re.compile(r"jobmap\[\d+\]\s*=\s*({.*?})") 
for item in pattern.findall(script.get_text(), re.MULTILINE): 
    print(item) 

打印:

{jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'} 
{jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'} 

注意,每個item值是不可直接加載json.loads(),請使用demjson.decode()或其他方式查看JavaScript對象字符串加載到P ython字典: