有沒有什麼辦法解析網站內容的DOM樹？

有一些從xml內容解析dom樹的軟件包，如https://docs.python.org/2/library/xml.dom.minidom.html。有沒有什麼辦法解析網站內容的DOM樹？

但我不想要目標xml，只有html網站的網頁內容。

from htmldom import htmldom 
dom = htmldom.HtmlDom("http://www.yahoo.com").createDom() 
# Find all the links present on a page and prints its "href" value 
a = dom.find("a") 
for link in a: 
    print(link.attr("href"))

但對於這個我收到此錯誤：

Error while reading url: http://www.yahoo.com 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/usr/local/lib/python2.7/dist-packages/htmldom/htmldom.py", line 333, in createDom 
    raise Exception 
Exception

見我已經籤BeautifulSoup，但不是我想要的。 Beautifulsoup僅適用於html頁面。如果頁面內容使用Javascript動態加載，則失敗。我不想分析使用getElementByClassName和類似的元素。但是dom.children(0).children(1)這樣的事情。

那麼有沒有什麼辦法像使用無頭瀏覽器，硒我可以解析整個DOM樹結構，並通過子和subchild我可以訪問targget元素？

來源

2015-11-03 user2129623

是的，但它不會很簡單，不足以將代碼包含在SO帖子中。儘管如此，你仍然處於正確的軌道。

基本上你需要使用你選擇的無頭渲染器（例如Selenium）來下載所有的資源並執行javascript。那裏真的沒有什麼用處。

然後，您需要將無頭渲染器中的HTML回顯到頁面就緒事件中的文件中（我使用的每個無頭瀏覽器都提供此功能）。此時，您可以在該文件上使用BeautifulSoup來導航DOM。你的願望BeautifulSoup不支持基於孩子跨越：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-down

來源

2015-11-03 07:47:47 0x24a537r9

的Python Selenium API爲您提供可能需要的一切。你可以用

html = driver.find_element_by_tag_name("html")

或

body = driver.find_element_by_tag_name("body")

開始，然後再從那裏與

body.find_element_by_xpath('/*[' + str(x) + ']')

這將等同於「body.children(x-1)」。您不需要使用BeautifulSoup或任何其他DOM遍歷框架，但您肯定可以通過獲取頁面源並讓其被另一個庫（如BeautifulSoup）解析：

soup = BeautifulSoup(driver.page_source) 
soup.html.children[0] #...

來源

2015-11-03 09:38:54

有沒有什麼辦法解析網站內容的DOM樹？

回答

相關問題