2016-12-07 1430 views

回答

1

最簡單的方法是通過BeautifulSoup使用正則表達式模式到兩個定位元件,並提取所需的子串:

import re 

from bs4 import BeautifulSoup 

data = """ 
<script> 
[other code] 
var my = 'hello'; 
var name = 'hi'; 
var is = 'halo'; 
[other code] 
</script> 
""" 

soup = BeautifulSoup(data, "html.parser") 

pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL) 
script = soup.find("script", text=pattern) 

print(pattern.search(script.text).group(1)) 

hello打印。

1

另一個想法是使用JavaScript分析器和定位變量聲明節點,檢查標識符爲期望的值,並且提取初始化。例如使用slimit parser

from bs4 import BeautifulSoup 
from slimit import ast 
from slimit.parser import Parser 
from slimit.visitors import nodevisitor 


data = """ 
<script> 
var my = 'hello'; 
var name = 'hi'; 
var is = 'halo'; 
</script> 
""" 

soup = BeautifulSoup(data, "html.parser") 

script = soup.find("script", text=lambda text: text and "var my" in text) 

# parse js 
parser = Parser() 
tree = parser.parse(script.text) 
for node in nodevisitor.visit(tree): 
    if isinstance(node, ast.VarDecl) and node.identifier.value == 'my': 
     print(node.initializer.value) 

hello打印。