R：使用rvest包而不是XML包來從URL中獲取鏈接

我使用XML包獲取this url的鏈接。R：使用rvest包而不是XML包來從URL中獲取鏈接

# Parse HTML URL 
v1WebParse <- htmlParse(v1URL) 
# Read links and and get the quotes of the companies from the href 
t1Links <- data.frame(xpathSApply(v1WebParse, '//a', xmlGetAttr, 'href'))

雖然這種方法是非常有效的，我用rvest，似乎在解析網頁比XML更快。我試過html_nodes和html_attrs，但我無法使它工作。

來源

2014-12-04 capm

'rvest'使用'XML'包提取節點。它真的不應該更快。 – hrbrmstr 2014-12-04 17:18:41

儘管我的評論，這裏是如何用rvest做到這一點。請注意，我們需要首先閱讀htmlParse頁面，因爲該網站的內容類型設置爲text/plain，並且該文件將rvest轉化爲眩暈。

library(rvest) 
library(XML) 

pg <- htmlParse("http://www.bvl.com.pe/includes/empresas_todas.dat") 
pg %>% html_nodes("a") %>% html_attr("href") 

## [1] "/inf_corporativa71050_JAIME1CP1A.html" "/inf_corporativa10400_INTEGRC1.html" 
## [3] "/inf_corporativa66100_ACESEGC1.html" "/inf_corporativa71300_ADCOMEC1.html" 
## ... 
## [273] "/inf_corporativa64801_VOLCAAC1.html" "/inf_corporativa58501_YURABC11.html" 
## [275] "/inf_corporativa98959_ZNC.html"

進一步示出rvest的XML包基礎。

UPDATE

rvest::read_html()直接現在可以處理這個問題：

pg <- read_html("http://www.bvl.com.pe/includes/empresas_todas.dat")

來源

2014-12-04 17:25:30 hrbrmstr

你說得對，節點提取'rvest'使用'XML'。我將在聊天中討論我使用軟件包的站點在時間上的差異。謝謝回覆。 – capm 2014-12-30 06:02:15

我知道您正在尋找rvest答案，但這裏有另一種方法，使用XML程序包，可能比您所做的更有效。

你見過example(htmlParse)的getLinks()函數嗎？我從示例中使用此修改後的版本獲取href鏈接。它是一個處理函數，所以我們可以在讀取數據時收集這些值，節省內存並提高效率。

links <- function(URL) 
{ 
    getLinks <- function() { 
     links <- character() 
     list(a = function(node, ...) { 
       links <<- c(links, xmlGetAttr(node, "href")) 
       node 
      }, 
      links = function() links) 
     } 
    h1 <- getLinks() 
    htmlTreeParse(URL, handlers = h1) 
    h1$links() 
} 

links("http://www.bvl.com.pe/includes/empresas_todas.dat") 
# [1] "/inf_corporativa71050_JAIME1CP1A.html" 
# [2] "/inf_corporativa10400_INTEGRC1.html" 
# [3] "/inf_corporativa66100_ACESEGC1.html" 
# [4] "/inf_corporativa71300_ADCOMEC1.html" 
# [5] "/inf_corporativa10250_HABITAC1.html" 
# [6] "/inf_corporativa77900_PARAMOC1.html" 
# [7] "/inf_corporativa77935_PUCALAC1.html" 
# [8] "/inf_corporativa77600_LAREDOC1.html" 
# [9] "/inf_corporativa21000_AIBC1.html"  
# ... 
# ...

來源

2014-12-04 15:29:59

偉大的幫助，我沒有檢查'htmlParse'中的例子，但我修改了我的代碼與您的建議。在這種情況下，'XML'工作的很好，但從這個[web]（http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20100101&fec_fin=20141130&nemonico=SIDERC1）獲取歷史價格所需的時間比' rvest'確實。 – capm 2014-12-04 16:22:05

價格？您的問題表明您正在嘗試獲取鏈接 – 2014-12-22 05:58:25

是的，來自[此網頁]（http://www.bvl.com.pe/includes/empresas_todas.dat）我試圖從網站獲取所有鏈接，而在[本網站]（http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20100101&fec_fin=20141130&nemonico=SIDERC1）我嘗試解析包含SIDERC1報價的歷史價格的表格。我在這兩個網站上都使用了「XML」，但我只能在後者上使用'rvest'。 – capm 2014-12-30 05:23:55

# Option 1 
library(RCurl) 
getHTMLLinks('http://www.bvl.com.pe/includes/empresas_todas.dat') 

# Option 2 
library(rvest) 
library(pipeR) # %>>% will be faster than %>% 
html("http://www.bvl.com.pe/includes/empresas_todas.dat")%>>% html_nodes("a") %>>% html_attr("href")

來源

2015-01-29 19:26:23

選項1似乎不再適用於當前版本的RCurl。 – 2017-03-27 17:17:06

理查德的回答適用於HTTP頁，但不是HTTPS頁面，我需要（維基百科）。我用RCurl的getURL函數取代如下：

library(RCurl) 

links <- function(URL) 
{ 
    getLinks <- function() { 
    links <- character() 
    list(a = function(node, ...) { 
     links <<- c(links, xmlGetAttr(node, "href")) 
     node 
    }, 
    links = function() links) 
    } 
    h1 <- getLinks() 
    xData <- getURL(URL) 
    htmlTreeParse(xData, handlers = h1) 
    h1$links() 
}

來源

2016-04-26 20:43:58 bshor

R：使用rvest包而不是XML包來從URL中獲取鏈接

回答

相關問題