Python（BeautifulSoup） - 從<script>獲取href

我正在研究「Video Downloader」，並且我有一個BeautifulSoup4問題。Python（BeautifulSoup） - 從<script>獲取href

這裏是HTML的一部分，從我希望得到A HREF：

<script src="/static/common.js?v7"></script> 
<script type="text/javascript"> 
      var c = 6; 
      window.onload = function() { 
       count(); 
      } 

      function closeAd(){ 
       $("#easy-box").hide(); 
      } 

      function notLogedIn(){ 
       $("#not-loged-in").html("You need to be logged in to download this movie!"); 
      } 

      function count() { 
       if(document.getElementById('countdown') != null){ 
        c -= 1; 
        //If the counter is within range we put the seconds remaining to the <span> below 
        if (c >= 0) 
         if(c == 0){ 
          document.getElementById('countdown').innerHTML = ''; 
         } 
         else { 
          document.getElementById('countdown').innerHTML = c; 
         } 
        else { 
         document.getElementById('download-link').innerHTML = '<a style="text-decoration:none;" href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi">Click here</a> to download requested file.'; 
         return; 
        }   
        //setTimeout('count()', 1000); 
       } 
      } 
     </script> 
<script type="text/javascript" src="/static/flowplayer/flowplayer-3.2.13.min.js"></script>

這裏是HREF我要打印：

href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi"

我這個嘗試，但它的不工作。

for a in soup3.find_all('a'): 
    if 'href' in a.attrs: 
     print(a['href'])

來源

2017-06-16 jestembotem

該href是JavaScript內。您可以抓住js部分並在[regex]（https://docs.python.org/3/howto/regex.html）的幫助下提取href。看看這個[問題]（https://stackoverflow.com/questions/24333189/parsing-js-with-beautiful-soup） – trotta

美麗的湯可以解析HTML和XML，而不是JavaScript。您可以使用正則表達式來搜索此代碼。
使用<a [^>]*?(href=\"([^\">]+)\")可以匹配這個代碼裏面的一切：

<a - 是a標籤
[^>]*? - 可以有不>
href="任何字符 - 有HREF
[^\">]+ - 除"和>之外還有任意數量的字符

從HTML中提取的腳本代碼可以使用
script = soup.find('script', {'type': 'text/javascript'})
，然後分析它，使用
re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)
記住import re第一。

print(re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)[1]) 
# href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi 
print(re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)[2]) 
# http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi

閱讀正則表達式。如果您要經常使用模式，請先編譯它。
https://docs.python.org/3/library/re.html

來源

2017-06-16 09:40:38 Szymon

謝謝你的回答，但我有一個錯誤：'print（re.search （r「] *？（href = \」（[^ \>>] +））\「」，script.text）[1]） AttributeError：'NoneType'對象沒有'text''屬性 – jestembotem

像BS一樣沒有找到任何'script'。你確定你使用了'soup.find（）'函數的適當參數嗎？ – Szymon

現在我得到了這個錯誤'print（re.search（r「）*？（href = \「（[^ \」>] +））\「」，script.text）[1]） TypeError：'NoneType'對象不可自訂' – jestembotem

Python（BeautifulSoup） - 從<script>獲取href

回答

相關問題