2017-06-30 87 views
1

我試圖從NIAC網站上刮掉國會成員的成績。下面是一個樣本代表的鏈接:https://www.niacaction.org/legislator-bio/?bid=C001097使用XML和Rvest在R中進行網頁搜刮

我的最終目標是建立一個數據框,其中包含國會成員的姓名,州名,分散注意力,然後是第113屆 - 115屆大會的成績。我正在使用XML並投入這樣做。這裏是我的代碼:

####----- Load Packages -----#### 
library('rvest') 
library('XML') 

####----- Scrape -----#### 
url <- 'https://www.niacaction.org/legislator-bio/?bid=C001097' 

nodes <- read_html(url, xpath = '//h3 | //*[contains(concat(" ", @class, " 
"), concat(" ", "entry-title", " "))]') 


page <- htmlTreeParse(nodes) 

當我打印我所謂的「頁面」時,我得到的信息比我想要的要多得多。我不明白爲什麼,因爲我清楚地確定了xpath。任何意見將非常感激。謝謝

回答

1

XML::htmlTreeParse相當於xml2::read_html(由rvest使用),它不接受XPath,爲此,請使用rvest::html_nodes。使用一個包裝或其他包裝;穿過它們會變得雜亂。 rvest也接受CSS選擇器,可以簡化它可以讓整齊:

library(rvest) 
library(tidyverse) # for munging; translate if you like 

url <- 'https://www.niacaction.org/legislator-bio/?bid=C001097' 

page <- url %>% read_html() 

cardenas <- page %>% { 
    data_frame(member = html_node(., 'h1') %>% html_text(), 
       grade = html_nodes(., 'h3') %>% html_text()) 
} %>% 
    separate(grade, c('congress', 'grade'), sep = ' Grade: ') %>% 
    separate(member, c('member', 'info'), sep = ' \\(') %>% 
    separate(info, c('party', 'state', 'district'), extra = 'drop', convert = TRUE) 

cardenas 
#> # A tibble: 4 x 6 
#>    member party state district  congress grade 
#> *    <chr> <chr> <chr> <int>   <chr> <chr> 
#> 1 Rep Tony Cárdenas  D CA  29  Current  A 
#> 2 Rep Tony Cárdenas  D CA  29 115th Congress  A 
#> 3 Rep Tony Cárdenas  D CA  29 114th Congress  C 
#> 4 Rep Tony Cárdenas  D CA  29 113th Congress  D 
+1

這正是我一直在尋找。謝謝你,我很感激 – Jordan