在R中進行網頁瀏覽，訪問html節點

rvest包的簡單應用：我試圖從一個網站上刮掉一類html鏈接。在R中進行網頁瀏覽，訪問html節點

此代碼讓我看起來像一個網站的權節點：

library(rvest) 
library(magrittr) 

foo <- "http://www.realclearpolitics.com/epolls/2010/house/2010_elections_house_map.html" %>% 
      read_html

另外，我所在使用CSS選擇合適的節點：

foo %>% 
    html_nodes("#states td") %>% 
    extract(2:4)

{xml_nodeset (3)} 
[1] <td>\n <a class="dem" href="/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html">\n <span>AR4</span>\n </a>\n</td> 
[2] <td>\n <a class="dem" href="/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html">\n <span>CT1</span>\n </a>\n</td> 
[3] <td>\n <a class="dem" href="/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html">\n <span>CT2</span>\n </a>\n</td>

好吧，所以href屬性是我正在尋找。但是，這

foo %>% 
    html_nodes("#states td") %>% 
    extract(2:4) %>% 
    html_attr("href")

回報

[1] NA NA NA

我如何訪問底層鏈接？

來源

2015-11-13 tomw

嘗試'foo％>％html_nodes（「＃states td a」）％>％extract（2：4）％>％html_attr（「href」）' – Jay

@jay你應該做出答案。湯姆：你並不是針對主播和杰倫的解決方案。 – hrbrmstr

使用xml_children()，你可以這樣做：

foo %>% 
    html_nodes('#states td') %>% 
    xml_children %>% 
    html_attr('href') %>% 
    extract(2:4)

[1] "/epolls/2010/house/ar/arkansas_4th_district_rankin_vs_ross-1343.html"    
[2] "/epolls/2010/house/ct/connecticut_1st_district_brickley_vs_larson-1713.html"  
[3] "/epolls/2010/house/ct/connecticut_2nd_district_peckinpaugh_vs_courtney-1715.html"

你可以把extract在html_attr前，可能還有一些其它序列可能工作了。

來源

2015-11-13 19:44:21 C8H10N4O2

在R中進行網頁瀏覽，訪問html節點

回答

相關問題