2016-07-24 88 views

回答

6

要添加更多一點的@Bob's answer和假設您還需要找到其中可能有其他script標籤的HTML標籤script

的思想是定義的正則表達式將用於既locating the element with BeautifulSoup並提取email值:

import re 

from bs4 import BeautifulSoup 


data = """ 
<body> 
    <script>jQuery(window).load(function() { 
     setTimeout(function(){ 
     jQuery("input[name=Email]").val("[email protected]"); 
     }, 1000); 
    });</script> 
</body> 
""" 
pattern = re.compile(r'\.val\("([^@][email protected][^@]+\.[^@]+)"\);', re.MULTILINE | re.DOTALL) 
soup = BeautifulSoup(data, "html.parser") 

script = soup.find("script", text=pattern) 
if script: 
    match = pattern.search(script.text) 
    if match: 
     email = match.group(1) 
     print(email) 

打印:[email protected]

在這裏,我們使用的是simple regular expression for the email address,但我們可以走得更遠,並更加嚴格,但我懷疑這將是實際需要的這個問題。

2

不可能只使用BeautifulSoup,但你可以做到這一點,例如與BS +正則表達式

import re 
from bs4 import BeautifulSoup as BS 

html = """<script> ... </script>""" 

bs = BS(html) 

txt = bs.script.get_text() 

email = re.match(r'.+val\("(.+?)"\);', txt).group(1) 

或像這樣:

... 

email = txt.split('.val("')[1].split('");')[0]