2014-12-02 70 views
1

我在R中的XML庫的工作,並希望到HTML中塊獨立的HTML文檔

myHTML <- htmlTreeParse("myHTMLfile.HTML", useInternal=T) 
unlist(xpathApply(myHTML, '//div', xmlValue)) 

這工作得很好,並讓我對整個事情串一個長矢量分開。不過,理想情況下,我想分塊分割我的HTML。 HTML結構如下:

<DOC> 
     <div> 
      Document 1 - Element 1 
     </div> 

     <div> 
      Document 1 - Element 2 
     </div> 

     <div> 
      Document 1 - Element 3 
     </div> 

    </DOC> 

    <DOC> 
     <div> 
      Document 2 - Element 1 
     </div> 

     <div> 
      Document 2 - Element 2 
     </div> 

     <div> 
      Document 2 - Element 3 
     </div> 

    </DOC> 

所以想有一個列表,其中每個元素對應於一個內容,並且每個列表的元素是串載體,含有元素1,2,3爲每個DOC。

我很努力(一)甚至查詢'DOC',因爲它不是命名空間的一部分?和(B)得到這種字符串向量輸出的列表。

所以不是這個輸出

[1] "Document 1 - Element 1" 
[2] "Document 1 - Element 2" 
[3] "Document 1 - Element 3" 
[4] "Document 2 - Element 1" 
[5] "Document 2 - Element 2" 
[6] "Document 2 - Element 3" 

我希望得到這樣的:

[[1]] 
    [1] "Document 1 - Element 1" 
    [2] "Document 1 - Element 2" 
    [3] "Document 1 - Element 3" 
[[2]] 
    [1] "Document 2 - Element 1" 
    [2] "Document 2 - Element 2" 
    [3] "Document 2 - Element 3" 

非常感謝您的幫助!

這裏是我想處理HTML文件的例子:

https://raw.githubusercontent.com/sytpp/sample-files/master/data_3.html 

回答

0

事情是這樣的:

dat <- c("Document 1 - Element 1", 
"Document 1 - Element 2", 
"Document 1 - Element 3", 
"Document 2 - Element 1", 
"Document 2 - Element 2", 
"Document 2 - Element 3") 

split(dat, sapply(strsplit(dat, " - "), "[", 1)) 

## $`Document 1` 
## [1] "Document 1 - Element 1" 
## [2] "Document 1 - Element 2" 
## [3] "Document 1 - Element 3" 
## 
## $`Document 2` 
## [1] "Document 2 - Element 1" 
## [2] "Document 2 - Element 2" 
## [3] "Document 2 - Element 3" 
2

這個怎麼樣。

library(XML) 
dd<-xmlInternalTreeParse("<DOCS><DOC> 
     <div>Document 1 - Element 1</div> 
     <div>Document 1 - Element 2</div> 
     <div>Document 1 - Element 3</div> 
</DOC><DOC> 
     <div>Document 1 - Element 3</div> 
     <div>Document 1 - Element 3</div> 
     <div>Document 1 - Element 3</div> 
</DOC></DOCS>") 


xmlApply(dd["//DOC"], function(x) xpathSApply(x,".//div", xmlValue)) 

我們發現所有的DOC元素,然後找到所有爲每個DOC的div所以我們結合外xmlApply找到DIV元素與內xpathSApplydiv

+0

是的,這個例子很有意義,但是當我將它應用到我的html時,我得到一個空的列表()。我可以與您分享一個真實的示例html3 元素嗎? – Sylvia 2014-12-02 22:06:44

+0

如果你剛剛更新了你的問題,那麼更準確地反映你的情況的樣本數據會更好。 – MrFlick 2014-12-02 22:15:13

+0

我添加了一個鏈接到該文件,從LexisNexis下載:https://raw.githubusercontent.com/sytpp/sample-files/master/data_3.html – Sylvia 2014-12-03 10:49:25

0

提取文本這裏的另一個可能性。我們可以在getNodeSet

library(XML) 
getNodeSet(xmlParseString(txt), "//DOC", fun = readHTMLList) 
#[[1]] 
#[1] "Document 1 - Element 1" "Document 1 - Element 2" "Document 1 - Element 3" 
# 
#[[2]] 
#[1] "Document 2 - Element 1" "Document 2 - Element 2" "Document 2 - Element 3" 

使用readHTMLList作爲函數調用或者我們也可以嘗試

lapply(xmlParseString(txt)["DOC"], readHTMLList) 
# $DOC 
# [1] "Document 1 - Element 1" "Document 1 - Element 2" 
# [3] "Document 1 - Element 3" 
# 
# $DOC 
# [1] "Document 2 - Element 1" "Document 2 - Element 2" 
# [3] "Document 2 - Element 3" 

其中txt

txt <- "<DOC>\n  <div>\n   Document 1 - Element 1\n  </div>\n\n  <div>\n   Document 1 - Element 2\n  </div>\n\n  <div>\n   Document 1 - Element 3\n  </div>\n\n </DOC>\n\n <DOC>\n  <div>\n   Document 2 - Element 1\n  </div>\n\n  <div>\n   Document 2 - Element 2\n  </div>\n\n  <div>\n   Document 2 - Element 3\n  </div>\n\n </DOC>" 

從你給出的網址,我是能夠得到以下結果

library(RCurl) 
content <- getURL(url) 
doc <- htmlTreeParse(content, useInternal=TRUE) 
values <- getNodeSet(doc, "//div", fun = xmlValue, trim = TRUE) 
str(values[1:6]) 
# List of 6 
# $ : chr "1 of 3 DOCUMENTS" 
# $ : chr "The Daily Telegraph (London)" 
# $ : chr "November 1, 2014 Saturday Edition 1; National Edition" 
# $ : chr "THE WEEK IN WESTMINSTER" 
# $ : chr "SECTION: FEATURES; Pg. 26" 
# $ : chr "LENGTH: 500 words" 
length(values) 
#[1] 39