使用R從不同的網頁上刮取數據

我希望能夠從網站上託管的大量表中獲取數據。問題在於，他們都在不同的網頁上。使用R從不同的網頁上刮取數據

以英國選舉選區here are links爲例。正如你所看到的，所有的選區都在那裏，每個選區都鏈接到一個單獨的頁面。如果您轉到個人選區頁面，則可以選擇下載郵政編碼的.csv文件，或者有一個html頁面。

我發現如何做到這一點，當各種數據源都在同一頁上，但是有可能做一些事情，將創建一個數據文件，結合每個地區的郵政編碼數據的解釋？

例如，我使用以下代碼得到了第一個區域Aberavon的數據，我在this question的答案中找到了一個版本。

library(XML) 
library(RCurl) 
install.packages("rlist") 
library(rlist) 

theurl <- getURL("https://www.doogal.co.uk/ElectoralConstituencies.php?constituency=W07000049",.opts = list(ssl.verifypeer = FALSE)) 
tables <- readHTMLTable(theurl) 
tables <- list.clean(tables, fun = is.null, recursive = FALSE) 
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

我通常使用[R所以將是一件好事知道如何使用R鍵做，但欣賞一些其他的方法可能更適合，並很樂意嘗試其他。我對數據挖掘非常陌生，所以如果這真的很明顯，我可能不會理解我讀過的指令的侷限性！

來源

2017-09-27 Megan

您需要捲起袖子並塗抹一些手肘潤滑脂。

在我對答案進行展開之前，我確實檢查了網站的抓取政策，包括robots.txt文件。說文件格式不正確：

User-agent: * 
Disallow:

我嫌疑網站的所有者爲了有Disallow:後/但沒有說我們不能湊。

一些庫，我們需要：

library(rvest) 
library(httr) 
library(tidyverse)

我使用的是不同的庫比你。如果您更願意堅持使用XML和RCurl程序包並使用R，那麼您需要修改此r等待另一個答案。

下獲得初始頁面數據，我們將礦中的鏈接：

res <- httr::GET("https://www.doogal.co.uk/ElectoralConstituencies.php") 

pg <- httr::content(res, as="parsed")

在你想要的主網頁數據的CSV時，這裏的網址吧：

html_nodes(pg, "li > a[href*='CSV']") %>% 
    html_attr("href") %>% 
    sprintf("https://www.doogal.co.uk/%s", .) -> main_csv_url

現在我們需要鏈接到各個選區。我檢查了HTML頁面內容與Chrome開發者工具找出CSS選擇器：

constituency_nodes <- html_nodes(pg, "td > a[href*='constituency=']") 
constituency_names <- html_text(constituency_nodes) 
constituency_ids <- gsub("^E.*=", "", html_attr(constituency_nodes, "href"))

請注意，我們只節省掉的ID，而不是完整的URL。

我們會做一個幫助功能來縮短事情。 httr::GET()讓我們像一個瀏覽器將：

get_constituency <- function(id) { 

    httr::GET(
    url = "https://www.doogal.co.uk/ElectoralConstituenciesCSV.php", 
    query = list(constituency = id) 
) -> res 

    httr::stop_for_status(res) 

    res <- read.csv(text = httr::content(res), stringsAsFactors=FALSE) 
    as_tibble(res) 

}

而且，然後調用我們的新功能適用於所有的選區。下面包括一個進度條免費：

pb <- progress_estimated(3) 
map_df(constituency_ids[1:3], ~{ 

    pb$tick()$print() 

    Sys.sleep(5) 

    get_constituency(.x) 

}) -> postcodes_df 

glimpse(postcodes_df) 
## Observations: 10,860 
## Variables: 12 
## $ Postcode <chr> "CF33 6PS", "CF33 6PT", "CF33 6PU", "CF33 6RA", "CF33 6R... 
## $ In.Use. <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "... 
## $ Latitude <dbl> 51.53863, 51.54013, 51.53815, 51.54479, 51.55091, 51.552... 
## $ Longitude <dbl> -3.700061, -3.699713, -3.690541, -3.684888, -3.673475, -... 
## $ Easting <int> 282191, 282219, 282850, 283259, 284066, 284886, 284613, ... 
## $ Northing <int> 183562, 183728, 183493, 184222, 184885, 185007, 183874, ... 
## $ Grid.Ref <chr> "SS821835", "SS822837", "SS828834", "SS832842", "SS84084... 
## $ Introduced <chr> "7/1/1995 12:00:00 AM", "1/1/1980 12:00:00 AM", "1/1/198... 
## $ Terminated <chr> "", "", "", "", "", "", "4/1/2002 12:00:00 AM", "", "1/1... 
## $ Altitude <int> 45, 47, 46, 76, 76, 131, 61, 27, 9, 7, 8, 8, 7, 8, 8, 8,... 
## $ Population <int> 34, NA, 7, 33, 48, 11, NA, 10, NA, NA, NA, NA, NA, NA, N... 
## $ Households <int> 12, NA, 3, 11, 19, 4, NA, 4, NA, NA, NA, NA, NA, NA, NA,...

注：

我只做了3次迭代，因爲我並不需要這些數據。取消您的需求的限制。
當你這樣做時，請保持延遲代碼在那裏。你有時間，這是他們的帶寬& CPU你會濫用。
上面的代碼中有足夠的數據作爲數據框字段添加到選區名稱中，但這是留給讀者的工作

來源

2017-09-27 12:11:07 hrbrmstr

使用R從不同的網頁上刮取數據

回答

相關問題