2009-12-04 30 views

回答

21

不是真的知道你想怎麼處理該頁面,因爲它是非常的混亂。正如我們re-learned in this famous stackoverflow question,這不是一個好主意,做HTML正則表達式,那麼你肯定會想用XML封裝解析這個。

下面是一個例子,讓你開始:

require(RCurl) 
require(XML) 
webpage <- getURL("http://www.haaretz.com/") 
webpage <- readLines(tc <- textConnection(webpage)); close(tc) 
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE) 
# parse the tree by tables 
x <- xpathSApply(pagetree, "//*/table", xmlValue) 
# do some clean up with regular expressions 
x <- unlist(strsplit(x, "\n")) 
x <- gsub("\t","",x) 
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE) 
x <- x[!(x %in% c("", "|"))] 

這導致大多隻是網頁文本的特徵向量(連同一些JavaScript):

> head(x) 
[1] "Subscribe to Print Edition"    "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"   
[4] "  Make Haaretz your homepage"   "/*check the search form*/"    "function chkSearch()" 
3

您最好的選擇可能是XML包 - 例如參見previous question

+0

但如何才能得到正確去掉html標籤。我知道我可以編寫一個RegEx表達式,但是有沒有使編碼更加戲劇化的包? – Mark 2009-12-04 05:56:41

2

我知道你問的R 。但是也許python + beautifullsoup在這裏是前進的方向?然後用R做你的分析,你用美麗的珠子颳了屏幕?