如何使用rvest獲取html_table內的鏈接？細胞

library("rvest") 
url <- "myurl.com" 
tables<- url %>% 
     read_html() %>% 
     html_nodes(xpath='//*[@id="pageContainer"]/table[1]') %>% 
     html_table(fill = T) 
tables[[1]]

html內容是這樣的如何使用rvest獲取html_table內的鏈接？細胞

<td><a href="http://somelink.com" target="_blank">Click Here</a></td>

但在刮HTML，如果你想獲得的價值我只得到，

點擊此處

來源

2017-02-08 ishandutta2007

你會得到滿意的所有單個細胞的HREFs？或者你是否特別想要一個data.frame格式的hrefs？因爲它應該很容易收集href屬性：'％>％html_nodes（「適當的xpath或選擇器」）％>％html_attr（「href」）;' – Chrisss

'XML :: getHTMLLinks（url，xpQuery =「//* [@ id =「pageContainer」]/table [1] // @ href「）'應該是你所需要的 –

「href」標籤請使用：

//*[@id="pageContainer"]/table[1]//@href

我在http://xpather.com/RtnrY9fh（xpath online）上測試了這個。

來源

2017-02-08 17:50:10 kieraf

您可以通過編輯rvest::html_table和trace來處理此問題。現有行爲

實施例：

library(rvest) 
x <- "https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture" %>% 
    read_html() %>% 
    html_nodes("#mw-content-text > table:nth-child(55)") 

html_table(x) 
#[[1]] 
#       Film Production company(s)       Producer(s) 
#1   The Great Ziegfeld  Metro-Goldwyn-Mayer      Hunt Stromberg 
#2    Anthony Adverse   Warner Bros.      Henry Blanke 
#3     Dodsworth Goldwyn, United Artists Samuel Goldwyn and Merritt Hulbert 
#4    Libeled Lady  Metro-Goldwyn-Mayer     Lawrence Weingarten 
#5  Mr. Deeds Goes to Town    Columbia       Frank Capra 
#6   Romeo and Juliet  Metro-Goldwyn-Mayer      Irving Thalberg 
#7    San Francisco  Metro-Goldwyn-Mayer John Emerson and Bernard H. Hyman 
#8 The Story of Louis Pasteur   Warner Bros.      Henry Blanke 
#9  A Tale of Two Cities  Metro-Goldwyn-Mayer     David O. Selznick 
#10   Three Smart Girls    Universal Joe Pasternak and Charles R. Rogers

html_table基本上提取的HTML表的單元格，並運行它們html_text。我們所需要做的就是通過從每個單元中提取<a>標籤並替代運行html_attr(., "href")來取代。

trace(rvest:::html_table.xml_node, quote({ 
    values  <- lapply(lapply(cells, html_node, "a"), html_attr, name = "href") 
    values[[1]] <- html_text(cells[[1]]) 
}), at = 14)

新行爲：

html_table(x) 
#Tracing html_table.xml_node(X[[i]], ...) step 14 
#[[1]] 
#          Film Production company(s)     Producer(s) 
#1    /wiki/The_Great_Ziegfeld     NA   /wiki/Hunt_Stromberg 
#2     /wiki/Anthony_Adverse     NA    /wiki/Henry_Blanke 
#3     /wiki/Dodsworth_(film)     NA   /wiki/Samuel_Goldwyn 
#4      /wiki/Libeled_Lady     NA  /wiki/Lawrence_Weingarten 
#5   /wiki/Mr._Deeds_Goes_to_Town     NA    /wiki/Frank_Capra 
#6  /wiki/Romeo_and_Juliet_(1936_film)     NA   /wiki/Irving_Thalberg 
#7   /wiki/San_Francisco_(1936_film)     NA /wiki/John_Emerson_(filmmaker) 
#8  /wiki/The_Story_of_Louis_Pasteur     NA    /wiki/Henry_Blanke 
#9 /wiki/A_Tale_of_Two_Cities_(1935_film)     NA  /wiki/David_O._Selznick 
#10    /wiki/Three_Smart_Girls     NA   /wiki/Joe_Pasternak

來源

2017-02-08 22:09:59 Chrisss

如何使用rvest獲取html_table內的鏈接？細胞

回答

相關問題