如何使用rvest在R中提取維基百科表中的特定元素？

例如，對於NYC我想從信息框中提取網站（右表）。如何使用rvest在R中提取維基百科表中的特定元素？

我使用這個：

url = "https://en.wikipedia.org/wiki/New_York_City" 
page = read_html(url) 

links = page %>% 
    html_nodes("table tr a")

但是，這是錯誤的。

來源

2017-10-21 Petr

DONE。對不起。 – Petr

考慮發佈答案或刪除問題。 – hrbrmstr

使用xpath您可以先以類名稱infobox獲取信息框，然後通過其標記名稱a獲取所有鏈接。

library("rvest") 

url <- "https://en.wikipedia.org/wiki/New_York_City" 
infobox <- url %>% 
    read_html() %>% 
    html_nodes(xpath='//table[contains(@class, "infobox")]//a') 

print(infobox)

輸出

{xml_nodeset (81)} 
[1] <a href="/wiki/City_(New_York)" class="mw-redirect" title="City (New York)">City</a> 
[2] <a href="/wiki/File:NYC_Montage_2014_4_-_Jleon.jpg" class="image" title="Clockwise, from top: Midtow ... 
[3] <a href="/wiki/Midtown_Manhattan" title="Midtown Manhattan">Midtown Manhattan</a> 
[4] <a href="/wiki/Times_Square" title="Times Square">Times Square</a> 
[5] <a href="/wiki/Unisphere" title="Unisphere">Unisphere</a> 
...

來源

2017-10-21 11:46:21

如何使用rvest在R中提取維基百科表中的特定元素？

回答

相關問題