用R和選擇器小工具進行網絡抓取

我想用R從a website中抓取數據。我使用rvest試圖模仿an example scraping the IMDB page for the Lego Movie。該示例主張使用名爲Selector Gadget的工具來幫助輕鬆識別與您要提取的數據相關聯的html_node。用R和選擇器小工具進行網絡抓取

我在構建具有以下架構/列的數據幀最終感興趣： rank，blog_name，facebook_fans，twitter_followers，alexa_rank。

我的代碼如下。我能夠使用Selector Gadget正確識別Lego示例中使用的html標籤。但是，遵循與樂高相同的流程和相同的代碼結構，我得到了NAs（...using firstNAs introduced by coercion[1] NA ）。我的代碼如下：

data2_html = read_html("http://blog.feedspot.com/video_game_news/") 
data2_html %>% 
    html_node(".stats") %>% 
    html_text() %>% 
    as.numeric()

我也試驗過：html_node("html_node(".stats , .stats span"))，這似乎爲「臉書粉絲團」欄目工作，因爲它報告714場比賽，但只返回返回1號。

714 matches for .//*[@class and contains(concat(' ', normalize-space(@class), ' '), ' stats ')] | .//*[@class and contains(concat(' ', normalize-space(@class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node} 
<td> 
[1] <span>997,669</span>

來源

2017-05-27 user2205916

這可能會幫助您：

library(rvest) 

d1 <- read_html("http://blog.feedspot.com/video_game_news/") 

stats <- d1 %>% 
    html_nodes(".stats") %>% 
    html_text() 

blogname <- d1%>% 
    html_nodes(".tlink") %>% 
    html_text()

需要注意的是html_nodes（複數）

結果：

> head(blogname) 
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games"   "Xbox Wire"     "Official PlayStation Blog" 
[5] "Nintendo Life "    "Game Informer" 

> head(stats,12) 
[1] "997,669" "1,209,029" "873"  "4,070,476" "4,493,805" "399"  "23,141,452" "10,210,993" "879"  
[10] "38,019,811" "12,059,607" "500"

blogname收益博客名稱列表，很容易管理。另一方面，統計信息混雜在一起。這是因爲Facebook和Twitter粉絲的stats課程彼此難以區分。在這種情況下，輸出數組每三個數字就有一個信息，即stats = c（fb，tw，alx，fb，tw，alx ...）。你應該從這個分離每個向量。

FBstats = stats[seq(1,length(stats),3)] 

> head(stats[seq(1,length(stats),3)]) 
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977"  "603,681"

來源

2017-05-27 01:53:54

這使用html_nodes（複數）和str_replace來刪除數字中的逗號。不知道這些是否是你需要的所有統計數據。

library(rvest) 
library(stringr) 
data2_html = read_html("http://blog.feedspot.com/video_game_news/") 
data2_html %>% 
    html_nodes(".stats") %>% 
    html_text() %>% 
    str_replace_all(',', '') %>% 
    as.numeric()

來源

2017-05-27 01:55:41 epi99

您可以使用html_table用最少的工作，提取出完整的表：

library(rvest) 
library(tidyverse) 

# scrape html 
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html() 

game_blogs <- h %>% 
    html_node('table') %>% # select enclosing table node 
    html_table() %>% # turn table into data.frame 
    set_names(make.names) %>% # make names syntactic 
    mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info 
    mutate_at(3:5, parse_number) %>% # make numbers actually numbers 
    tbl_df() # for printing 

game_blogs 
#> # A tibble: 119 x 5 
#>  Rank     Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank 
#> <int>      <chr>   <dbl>    <dbl>  <dbl> 
#> 1  1 Kotaku - The Gamer's Guide  997669   1209029  873 
#> 2  2   IGN | Video Games  4070476   4493805  399 
#> 3  3     Xbox Wire  23141452   10210993  879 
#> 4  4 Official PlayStation Blog  38019811   12059607  500 
#> 5  5    Nintendo Life   35977    95044  17727 
#> 6  6    Game Informer  603681   1770812  10057 
#> 7  7   Reddit | Gamers  1003705   430017   25 
#> 8  8     Polygon  623808   485827  1594 
#> 9  9 Xbox Live's Major Nelson   65905   993481  23114 
#> 10 10      VG247  397798   202084  3960 
#> # ... with 109 more rows

值得一檢查，一切都被解析就像你想要的，但它應該在這一點上是可用的。

來源

2017-05-27 05:59:46 alistaire

看起來非常酷，但我無法複製您的結果。錯誤：'game_blogs <- h %>％ html_node（'table'）％>％＃選擇包含表節點 html_table（）％>％＃將錶轉換爲data.frame set_names（make.names）錯誤：'x'和' nm'必須是相同的長度' – user2205916

啊！對不起，這是使用'purrr :: set_names'的開發版本，它可以使用一個函數。你可以從[Github]（https://github.com/tidyverse/purrr/）安裝它，或者使用'set_names（make.names（names（。）））'來執行相同的操作。 – alistaire

用R和選擇器小工具進行網絡抓取

回答

相關問題