美麗的湯，忽略網頁上的重要內容

我正在用美麗的湯來解析這個library hour page。由於今天天氣不好，該網頁會向所有學生顯示警告消息。包含警報消息的HTML代碼如下：美麗的湯，忽略網頁上的重要內容

<div id="alert-container"> 
    <div class="alert alert-error"> 
    <p>The University will resume normal operations on Wednesday, March 15.&nbsp; All Library facilities will be open according to the Spring Break 
schedule. &nbsp; 
    <a href="http://hours.cul.columbia.edu/">Library Hours » 
    </a> 
    </p> 
    </div> 
</div> 
<!-- 
<div class="alert alert-error" style="margin-bottom:15px;text align:center;"> 
<a href="http://library.columbia.edu/news/alert.html">Normal operations are expected to resume Monday, January 25. &nbsp; More information &raquo</a> 
</div> 
-->

我要分析此警報消息，但事實證明，不管我用lxml或html5lib，它給我的錯誤解析結果：

<div id="alert-container"> 
</div> 
    <!-- 
<div class="alert alert-error" style="margin-bottom:15px;text-align:center;">\ 
    <a href="http://library.columbia.edu/news/alert.html">Normal operations are expected to resume Monday, January 25. &nbsp; More information &raquo 
    </a> 
</div> 
-->

也就是說，它刪除<div id="alert-container"></div>中的所有內容，這對我來說似乎很陌生。我已經解析了一些網站，這是我第一次遇到這樣的問題，我想我跟着來分析網站的正確方法：

import urllib2 
import html5lib 
from bs4 import BeautifulSoup 
url = "https://hours.library.columbia.edu" 
page = urllib2.urlopen(url) 
soup = BeautifulSoup(page, 'lxml') #or html5lib 
soup.find("div", {"id":"alert-container"})

和運行上面的代碼的結果是：

<div id="alert-container"></div>

我想知道這是網站本身的問題還是因爲解析器？

預先感謝您！

來源

2017-03-15 lleiou

該網站可能會使用ajax獲取數據。但urllib2.urlopen返回靜態頁面。那麼如何使用Phantom JS？它在網站上執行js。並在ajax之後獲取頁面。 – ikicha

@ikicha非常感謝你！ 'PhantomJS'是一個非常有用的工具，我會學習使用它！ – lleiou

如果您是js中的新手，casperjs是替代品之一 – ikicha

這是因爲初始頁首先在「alert-container」中沒有任何元素，但通過Ajax請求（「https://api.library.columbia.edu/query.json?qt=alerts」）請求這些元素，該請求返回一個字符串作爲json格式。

此代碼應該工作。

import urllib2 
import json 

url = "https://api.library.columbia.edu/query.json?qt=alerts" 
alert = json.load(urllib2.urlopen(url)) 
print(alert) 
print(alert["alerts"][0]["html"])

來源

2017-03-15 01:48:48 klim

它的工作原理，謝謝！我以前不認識Ajax，現在我會學習它！ – lleiou

美麗的湯，忽略網頁上的重要內容

回答

相關問題