2016-05-31 21 views
0

我正在學習網絡抓取並試圖從https://www.kununu.com/us/google1/reviews中刮取信息。在頁面末尾存在「加載更多」選項時使用rvest刮擦數據

這裏是我的代碼:RM(名單= LS())

library(httr) 
library(rvest) 
library(xml2) 
library(curl) 

url <- "https://www.kununu.com/us/google1/reviews" 

reviews <- url %>% 
    read_html() %>% 
    html_nodes(".panel-body") 

quote <- reviews %>% 
    html_nodes("h2 a") %>% 
    html_text() 

rating <- reviews %>% 
    html_nodes(".tile-heading") %>% 
    html_text() 

date <- reviews %>% 
    html_nodes("strong") %>% 
    html_text() 

a <- data.frame(quote, rating, date, stringsAsFactors = FALSE) 

然而,上面的代碼擦傷只有大約10個團體。我在網上找到了關於動態網站的RSelenium包的一些建議。不幸的是,當我使用checkForServer()時,我的計算機會拋出錯誤,然後是startServer()命令。當LOAD MORE選項位於底部時,有任何想法可以一次性刪除所有56條評論?

回答

0

如果您將鼠標懸停在Load More鏈接上,您會發現它只是在您的網址末尾添加一個整數。因此,只需循環瀏覽網頁即可獲取全部內容。首先,從提取評論數開始,然後獲取頁面數量,然後使用您的代碼獲取數據...

library(httr) 
library(rvest) 
library(xml2) 
library(curl) 
library(plyr) 

url <- "https://www.kununu.com/us/google1/reviews" 
num_of_reviews <- read_html(url) %>% 
    html_nodes(".title-number") %>% 
    .[[1]] %>% 
    html_text() 
# round up to nearest 10s 
num_of_reviews_rounded <- num_of_reviews %>% 
    as.numeric() %>% 
    round_any(10, f = ceiling) 
pages <- 1 : (num_of_reviews_rounded/10) 

get_reviews <- function(url){ 
    reviews <- url %>% 
    read_html() %>% 
    html_nodes(".panel-body") 

    quote <- reviews %>% 
    html_nodes("h2 a") %>% 
    html_text() 

    rating <- reviews %>% 
    html_nodes(".tile-heading") %>% 
    html_text() 

    date <- reviews %>% 
    html_nodes("strong") %>% 
    html_text() 

    a <- data.frame(quote, rating, date, stringsAsFactors = FALSE) 
    return(a) 
} 

list_of_dfs <- lapply(pages, function(x)get_reviews(paste0(url, "/", x))) 
df <- do.call(rbind, list_of_dfs) 

> str(df) 
'data.frame': 56 obs. of 3 variables: 
$ quote : chr "Exceptional: 4.13 of 5" "Noteworthy: 3.75 of 5" "Remarkable: 5.00 of 5" "Exemplary: 4.25 of 5" ... 
$ rating: chr "\n  4.13\n " "\n  3.75\n " "\n  5.00\n " "\n  4.25\n " ... 
$ date : chr "Dec 30, 2015" "Dec 30, 2015" "Dec 30, 2015" "Dec 29, 2015" ... 
+0

感謝科裏,我會盡力的。 –