在頁面末尾存在「加載更多」選項時使用rvest刮擦數據

我正在學習網絡抓取並試圖從https://www.kununu.com/us/google1/reviews中刮取信息。在頁面末尾存在「加載更多」選項時使用rvest刮擦數據

這裏是我的代碼：RM（名單= LS（））

library(httr) 
library(rvest) 
library(xml2) 
library(curl) 

url <- "https://www.kununu.com/us/google1/reviews" 

reviews <- url %>% 
    read_html() %>% 
    html_nodes(".panel-body") 

quote <- reviews %>% 
    html_nodes("h2 a") %>% 
    html_text() 

rating <- reviews %>% 
    html_nodes(".tile-heading") %>% 
    html_text() 

date <- reviews %>% 
    html_nodes("strong") %>% 
    html_text() 

a <- data.frame(quote, rating, date, stringsAsFactors = FALSE)

然而，上面的代碼擦傷只有大約10個團體。我在網上找到了關於動態網站的RSelenium包的一些建議。不幸的是，當我使用checkForServer（）時，我的計算機會拋出錯誤，然後是startServer（）命令。當LOAD MORE選項位於底部時，有任何想法可以一次性刪除所有56條評論？

來源

2016-05-31 mull_llum

如果您將鼠標懸停在Load More鏈接上，您會發現它只是在您的網址末尾添加一個整數。因此，只需循環瀏覽網頁即可獲取全部內容。首先，從提取評論數開始，然後獲取頁面數量，然後使用您的代碼獲取數據...

library(httr) 
library(rvest) 
library(xml2) 
library(curl) 
library(plyr) 

url <- "https://www.kununu.com/us/google1/reviews" 
num_of_reviews <- read_html(url) %>% 
    html_nodes(".title-number") %>% 
    .[[1]] %>% 
    html_text() 
# round up to nearest 10s 
num_of_reviews_rounded <- num_of_reviews %>% 
    as.numeric() %>% 
    round_any(10, f = ceiling) 
pages <- 1 : (num_of_reviews_rounded/10) 

get_reviews <- function(url){ 
    reviews <- url %>% 
    read_html() %>% 
    html_nodes(".panel-body") 

    quote <- reviews %>% 
    html_nodes("h2 a") %>% 
    html_text() 

    rating <- reviews %>% 
    html_nodes(".tile-heading") %>% 
    html_text() 

    date <- reviews %>% 
    html_nodes("strong") %>% 
    html_text() 

    a <- data.frame(quote, rating, date, stringsAsFactors = FALSE) 
    return(a) 
} 

list_of_dfs <- lapply(pages, function(x)get_reviews(paste0(url, "/", x))) 
df <- do.call(rbind, list_of_dfs) 

> str(df) 
'data.frame': 56 obs. of 3 variables: 
$ quote : chr "Exceptional: 4.13 of 5" "Noteworthy: 3.75 of 5" "Remarkable: 5.00 of 5" "Exemplary: 4.25 of 5" ... 
$ rating: chr "\n  4.13\n " "\n  3.75\n " "\n  5.00\n " "\n  4.25\n " ... 
$ date : chr "Dec 30, 2015" "Dec 30, 2015" "Dec 30, 2015" "Dec 29, 2015" ...

來源

2016-06-02 14:26:16 cory

感謝科裏，我會盡力的。 –

在頁面末尾存在「加載更多」選項時使用rvest刮擦數據

回答

相關問題