在R中使用htmlParse（）時缺少網站源代碼

我在嘗試下載以下網站的完整源代碼： http://www.carnegiehall.org/Students/。在R中使用htmlParse（）時缺少網站源代碼

我想提取的信息是以下部分：

卡內基音樂廳呈獻

週四，2013年3月28日| 7:30 PM

勞倫斯·布朗利

馬丁·卡茨

·贊克廳

查看源文件顯示的代碼文本以下塊：

<div class="info-col"> 
    <div class="up-lic">Carnegie Hall Presents</div> 
    <div class="date">Thursday, March 28, 2013 | 7:30 PM</div> 
    <div class="clearfix"></div> 
    <div class="title color"> 
     <a href="/Calendar/2013/3/28/0730/PM/Lawrence-Brownlee-Martin-Katz/">Lawrence Brownlee<BR>Martin Katz</a> 
    </div> 
    <div class="clearfix"></div> 
    <div class="location"> Zankel Hall</div> 
    <div class="clearfix"></div> 
    <br />

一個缺少當我在R中運行以下內容時：

htmlParse(getURL("http://www.carnegiehall.org/Students", .opts = curlOptions(followlocation=TRUE)), asText = TRUE)

任何人都可以告訴我我做錯了什麼嗎？

來源

2013-03-25 Kim

看來問題只是獲取URL（而不是解析它）。你正在尋找的信息也沒有過來，如下所示：

H <- getURL("http://www.carnegiehall.org/Students", .opts = curlOptions(followlocation=TRUE)) 

grepl("Zankel Hall", H) 
# [1] FALSE 

grepl("March 28", H) 
# [1] FALSE

如果在HTML仔細一看，它出現在日曆正在通過腳本加載

來源

2013-03-25 05:18:01

library(XML) 
hdata <- htmlParse('http://www.carnegiehall.org/Students/') 
xpathSApply(hdata,'//div[@class="info-col"]/div/text()|//div[@class="info-col"]/div/a/text()') 
#[[1]] 
#Carnegie Hall Presents 

#[[2]] 
#Thursday, March 28, 2013 | 7:30 PM 

#[[3]] 


#[[4]] 
#Lawrence Brownlee 

#[[5]] 
#Martin Katz 

#[[6]] 
# Zankel Hall 

#[[7]]

來源

2013-03-25 10:59:47 user1609452

在R中使用htmlParse（）時缺少網站源代碼

回答

相關問題