抓取與rvest網站（改變頁面，點擊鏈接）

我抓取與rvest的一個研究項目，一個網站，我遇到兩個問題：抓取與rvest網站（改變頁面，點擊鏈接）

1）我的循環似乎是重複的拼搶一遍又一遍的相同頁面，而不是轉到以下頁面。

2）我無法訪問我正在刮的鏈接的全文。換句話說，我不僅想要刮取搜索結果，而且要顯示每個顯示的鏈接的內容。我有代碼在每個單獨的頁面上執行此操作（請參閱下文），但由於有2600個鏈接，因此我想將它們各自的內容集成到抓取中（就像rvest在每個鏈接上「點擊」並抓取它們內容）。

背景：法國政府頁面。我正在尋找所有含有「inegalites de sante」字樣的內容。這給出了近似2600個結果，每個頁面顯示30個結果。因此，我運行了88次循環來收集所有結果。然而，它一次又一次地給了我30個相同的結果，而且只是在每個結果的下面，而不是每個話語的全文，都要抄下小文本的引文。

訪問網站：http://www.vie-publique.fr/rechercher/recherche.php?replies=30&query=inegalites+de+sante&typeloi=&filter=&skin=cdp&date=1&auteur=&source=&typeDoc=&date=&sort=&filtreAuteurLibre=&dateDebut=&dateFin=&nbResult=2612&q=

library(rvest) 
library(purrr) 

url_base <- "http://www.vie-publique.fr/rechercher/recherche.php?replies=30&query=inegalites+de+sante&typeloi=&filter=&skin=cdp&date=1&auteur=&source=&typeDoc=&date=&sort=&filtreAuteurLibre=&dateDebut=&dateFin=&nbResult=2612&q=" 

map_df(1:88, function(i) { 

    # Progress indicator 
    cat(".") 

    pg <- read_html(sprintf(url_base, i)) 

    data.frame(date=html_text(html_nodes(pg, ".date")), 
      text=html_text(html_nodes(pg, ".recherche_montrer")), 
      title=html_text(html_nodes(pg, ".titre a")), 
      stringsAsFactors=FALSE) 

}) -> viepublique_data 

dplyr::glimpse(viepublique_data) 

write.xlsx(viepublique_data, "/Users/Etc.Etc./viepublique_data.xlsx")

這裏是代碼，我會用刮每一個人頁面以獲取全文，以第一話語（沒有「103000074」）爲例：

#### Code to scrape each individual page 

website <- read_html("http://discours.vie-publique.fr/notices/103000074.html") 

section <- website %>% 
    html_nodes(".level1 a") 
section 

subsection <- website %>% 
    html_nodes(".level2 p") 
subsection 

person <- website %>% 
    html_nodes("p:nth-child(2) , .article p:nth-child(1)") 
person 

text <- website %>% 
    html_nodes(".col1 > p") 
text 

title <- website %>% 
    html_nodes("h2") 
title

非常感謝您的幫助！

來源

2017-04-04 Evelyne1991

'sprintf（url_base，1:88）'返回88次相同的URL。你希望'sprintf'能做什麼？ – MrFlick

@MrFlick我認爲它會改變頁面88次（如點擊「下一步」），這顯然不是這裏的情況。 – Evelyne1991

一般人會想到硒，如果有人點擊鏈接。 –

你可以做到以下幾點：

require(rvest) 
require(tidyverse) 
require(stringr) 

# The url parameter of interest is the "b" at the end 
# it is used for pagination. Just plut in ther 30*(0:87) to get 
# the urls of your 88 pages 
url_base <- "http://www.vie-publique.fr/rechercher/recherche.php?query=inegalites%20de%20sante&date=&dateDebut=&dateFin=&skin=cdp&replies=30&filter=&typeloi=&auteur=&filtreAuteurLibre=&typeDoc=&source=&sort=&q=&b=" 
l_out <- 88 
urls <- paste0(url_base, seq(0, by = 30, length.out = l_out))

定義輔助功能刮網站：

# Helper function for parsing overview 
parse_overview <- function(x){ 
    tibble(date = html_text(html_nodes(x, ".date"), TRUE), 
     text_1 = html_text(html_nodes(x, ".recherche_montrer"), TRUE), 
     title = html_text(html_nodes(x, ".titre a"), TRUE), 
     link = str_trim(html_attr(html_nodes(x, ".titre a"), "href"))) 
} 

# Helper function for collapse multi-line output like person and text 
collapse_to_text <- function(x){ 
    p <- html_text(x, trim = TRUE) 
    p <- p[p != ""] # drop empty lines 
    paste(p, collapse = "\n") 
} 

# Parse the result itself 
parse_result <- function(x){ 
    tibble(section = html_text(html_node(x, ".level1 a"), trim = TRUE), 
     sub_section = html_text(html_node(x, ".level2 a"), trim = TRUE), 
     person = html_nodes(x, "p:nth-child(2) , .article p:nth-child(1)") %>% collapse_to_text, 
     text_2 = html_nodes(x, ".col1 > p") %>% collapse_to_text) 
}

實際刮痧是爲完成如下：

# Scrape overview  
overview_content <- urls %>% 
    map(read_html) %>% 
    map_df(parse_overview) 

# scrape all pages - that may take a while... slow website 
detail_content <- overview_content$link %>% 
    map(read_html) %>% 
    map_df(parse_result) 

out <- bind_cols(overview_content, detail_content)

這給你

Variables: 8 
$ date  <chr> "11/01/2010", "06/02/2014", "31/03/2011", "30/08/2010", "21/09/2010", "19/05/2010" 
$ text_1  <chr> "En effet, l' inégalité d'information n'est pas le moindre déterminant des inégalités de santé",... 
$ title  <chr> "1 - Déclaration de Mme Roselyne Bachelot-Narquin, ministre de la santé et des sports, sur la ré... 
$ link  <chr> "http://discours.vie-publique.fr/notices/103000074.html", "http://discours.vie-publique.fr/notic... 
$ section  <chr> "Discours publics", "Discours publics", "Discours publics", "Discours publics", "Discours public... 
$ sub_section <chr> "Les discours dans l'actualité", "Les discours dans l'actualité", "Les discours dans l'actualité... 
$ person  <chr> "Personnalité, fonction : BACHELOT-NARQUIN Roselyne.\nFRANCE. Ministre de la santé et des sports... 
$ text_2  <chr> "ti : Madame la ministre, chère Fadela,Monsieur le directeur général de la santé, cher Didier Ho...

來源

2017-04-04 09:57:53 Rentrop

非常感謝！這真的很酷。（1）用「l_out < - 88」代替「l_out < - 3」，（2）插入30 *（ 0:87）在URL中的「b =」之後，並且（3）改變鏈接（由於某種原因，你使用的結果給出了單詞搜索的0結果，但是下面的工作：http：//www.vie-publique .FR/rechercher/recherche.php？查詢= inegalites％20de％20sante＆日期=＆dateDebut =＆dateFin = b = 0＆皮膚= CDP＆答覆= 10＆濾波器=＆typeloi =＆導演=＆filtreAuteurLibre =＆typeDoc =＆源=＆排序=＆q =）。你知道這是爲什麼嗎？ – Evelyne1991

剛編輯我的答案。請按原樣進行編碼。 @ Evelyne1991我不完全明白你的意思是什麼URL ...如果你拿我的'url_base'你看到最後有'sort =＆q =＆b ='所以我的函數在'sort =＆q =＆b = 0'，'sort =＆q =＆b = 30'，'sort =＆q =＆b = 60'等等。這被稱爲URL參數，並對頁面進行分頁。因此，例如六頁的頁面以'b = 150'結尾：http://www.vie-publique.fr/rechercher/recherche.php？query = inegalites％20de％20sante＆date =＆dateDebut =＆dateFin =＆skin = cdp＆answers = 30＆filter =＆typeloi =＆auteur =＆filtreAuteurLibre =＆typeDoc =＆source =＆sort =＆q =＆b = 150 – Rentrop

非常感謝！它完美的作品。我非常感謝你的幫助。 – Evelyne1991

抓取與rvest網站（改變頁面，點擊鏈接）

回答

相關問題