2015-11-13 42 views
1

rvest包的簡單應用:我試圖從一個網站上刮掉一類html鏈接。在R中進行網頁瀏覽,訪問html節點

此代碼讓我看起來像一個網站的權節點:

library(rvest) 
library(magrittr) 

foo <- "http://www.realclearpolitics.com/epolls/2010/house/2010_elections_house_map.html" %>% 
      read_html 

另外,我所在使用CSS選擇合適的節點:

foo %>% 
    html_nodes("#states td") %>% 
    extract(2:4) 

返回

{xml_nodeset (3)} 
[1] <td>\n <a class="dem" href="/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html">\n <span>AR4</span>\n </a>\n</td> 
[2] <td>\n <a class="dem" href="/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html">\n <span>CT1</span>\n </a>\n</td> 
[3] <td>\n <a class="dem" href="/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html">\n <span>CT2</span>\n </a>\n</td> 

好吧,所以href屬性是我正在尋找。但是,這

foo %>% 
    html_nodes("#states td") %>% 
    extract(2:4) %>% 
    html_attr("href") 

回報

[1] NA NA NA 

我如何訪問底層鏈接?

+2

嘗試'foo%>%html_nodes(「#states td a」)%>%extract(2:4)%>%html_attr(「href」)' – Jay

+1

@jay你應該做出答案。湯姆:你並不是針對主播和杰倫的解決方案。 – hrbrmstr

回答

1

使用xml_children(),你可以這樣做:

foo %>% 
    html_nodes('#states td') %>% 
    xml_children %>% 
    html_attr('href') %>% 
    extract(2:4) 

返回:

[1] "/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html"    
[2] "/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html"  
[3] "/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html" 

你可以把extracthtml_attr前,可能還有一些其它序列可能工作了。