使用XML包通過id和class解析HTML元素

是否可以通過id和class信息從HTMLInternalDocument對象中提取元素？例如讓我們的文檔：使用XML包通過id和class解析HTML元素

<!DOCTYPE html> 
<html> 
<head> 
    <title>R XML test</title> 
</head> 
<body> 
<div id="obj1"> 
    <p id="txt1">quidquid</p> 
    <p id="txt2">Latine dictum</p> 
</div> 
<div class="mystuff"> 
    <p>sit altum</p> 
    <p>videtur</p> 
</div> 
</body> 
</html>

，並讀入R作爲如下：

require(XML) 
file <- "C:/filepath/index.html" 
datain <- htmlTreeParse(readLines(file), useInternalNodes = TRUE)

我想提取元素的含量id='txt2'和class='mystuff'。

我已經嘗試過各種方法沒有成功，他們都似乎迭代了很痛苦的樹。有沒有使用class/id的快捷方式？我有一個想法，它可能涉及使用第一getNodeSet其次是一些應用方法（例如xmlApply & xmlAttrs），但沒有我試過的作品。感謝任何指針。

來源

2015-01-06 geotheory

什麼「內容」你的意思是，文本？試試'cat（sapply（datain ['// * [@ id =「txt2」] | // * [@ class =「mystuff」]']，xmlValue））''。 – lukeA

看起來很有希望。原諒我的無知，但我還沒有在'datain ['// * [@ id =「txt2」]']之前看到這個表達式是XML庫方法嗎？ – geotheory

有關詳細信息，請查看'getNodeSet'下的幫助：'getNodeSet（datain，'// * [@ id =「txt2」]'）'。 – lukeA

試試這個，例如：

id_or_class_xp <- "//p[@id='txt2']//text() | //div[@class='mystuff']//text()" 
xpathSApply(doc,id_or_class_xp,xmlValue) 

[1] "Latine dictum" "\n "  "sit altum"  "\n "  "videtur"  "\n"

其中DOC是：

doc <- htmlParse('<!DOCTYPE html> 
<html> 
<head> 
    <title>R XML test</title> 
</head> 
<body> 
<div id="obj1"> 
    <p id="txt1">quidquid</p> 
    <p id="txt2">Latine dictum</p> 
</div> 
<div class="mystuff"> 
    <p>sit altum</p> 
    <p>videtur</p> 
</div> 
</body> 
</html>',asText=T)

來源

2015-01-06 12:56:32 agstudy

感謝agstudy（和@lukeA）這很有幫助 – geotheory

使用XML包通過id和class解析HTML元素

回答

相關問題