2016-12-21 54 views
-1

我是新來的網絡抓取並希望將其用於感性分析。我已經成功取消了前10條評論。對於其他280條評論,我猶豫要重複以下過程超過20次......我想知道是否有一個包/功能可以讓我以更簡單的方式抓取所有評論?非常感謝!如何使用rvest從IMDB中刪除所有電影評論

library(rvest) 
library(XML) 
library(plyr) 
HouseofCards_IMDb <- read_html("http://www.imdb.com/title/tt1856010/reviews?ref_=tt_urv") 

#Used SelectorGadget as the CSS Selector 
reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>% 
html_nodes("div+p") %>% 
html_text() 

#perfrom data cleaning on user reviews 
reviews <- gsub("\r?\n|\r", " ", reviews) 
reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews)) 
sapply(reviews, function(x){}) 
print(reviews) 

回答

2

歡迎來到SO。

如果您轉到第二頁評論,您會注意到URL的變化從http://www.imdb.com/title/tt1856010/reviewshttp://www.imdb.com/title/tt1856010/reviews?start=10

最後一頁:http://www.imdb.com/title/tt1856010/reviews?start=290

所有您需要做的是循環一翻:

result <- c() 
for(i in c(1, seq(10, 290, 10))) { 
    link <- paste0("http://www.imdb.com/title/tt1856010/reviews?start=",i) 
    HouseofCards_IMDb <- read_html(link) 

    # Used SelectorGadget as the CSS Selector 
    reviews <- HouseofCards_IMDb %>% html_nodes("#pagecontent") %>% 
    html_nodes("div+p") %>% 
    html_text() 

    # perfrom data cleaning on user reviews 
    reviews <- gsub("\r?\n|\r", " ", reviews) 
    reviews <- tolower(gsub("[^[:alnum:] ]", " ", reviews)) 
    sapply(reviews, function(x){}) 
    result <- c(result, reviews) 
} 

請注意,我們先從http://www.imdb.com/title/tt1856010/reviews?start=1這是類似http://www.imdb.com/title/tt1856010/reviews