從html源碼讀取XML數據到R

我想從給定的網頁導入數據到R，例如this one。從html源碼讀取XML數據到R

在源代碼中（但不是實際的頁面上），我想獲得的數據存儲在JavaScript代碼的單一線，開始是這樣的：

chart_Line1.setDataXML("<graph rotateNames (stuff omitted) > 
<set value='699.99' name='16.02.2013' /> 
<set value='731.57' name='18.02.2013' /> 
<set value='more values' name='more dates' /> 
... 
<trendLines> (now a different command starts, stuff omitted) 
</trendLines></graph>")

（請注意，我爲了便於閱讀，我們使用了換行符;數據在原始文件中只有一行，只需要導入以chart_Line1.setDataXML開頭的行 - 如果您想自己查看，那麼在源代碼中是第56行）

我可以使用scan("URLofFile", what="raw")將整個html文件讀入字符串，但是如何從中提取數據？

我可以使用what="..."指定數據格式，請記住沒有換行符來分隔數據，但在不相關的前綴和後綴中有幾個換行符？

這是可以用R工具以很好的方式完成的東西，還是您建議這個數據採集應該使用不同的腳本來完成？

來源

2014-02-08 Roland

隨着一些試驗&錯誤，我能夠找到包含數據的確切行。我讀了整個HTML文件，然後處理所有其他行。

require(zoo) 
require(stringr) 
# get html data, scrap all lines but the interesting one 
theurl <- "https://www.magickartenmarkt.de/Black_Lotus_Unlimited.c1p5093.prod" 
sec <- scan(file =theurl, what = "character", sep="\n") 
sec <- sec[45] 
# extract all strings of the form "value='X'", where X is a 1 to 3 digit number with some separator and 2 decimal places 
values <- str_extract_all(sec, "value='[0-9]{1,3}.[0-9]{2}'") 
# dispose of all non-numerical, non-separator values 
values <- str_replace_all(unlist(values),"[^0-9/.]","") 
# get all dates in the form "name='DD.MM.YYYY" 
dates <- str_extract_all(sec, "name='[0-9]{2}.[0-9]{2}.[0-9]{4}'") 
# dispose of all non-numerical, non-separator values 
dates <- str_replace_all(unlist(dates),"[^0-9/.]","") 
# convert dates to canonical format 
dates <- as.Date(dates,format="%d.%m.%Y") 
# put values and dates into a list of ordered observations, converting the values from characters to numbers first. 
MyZoo <- zoo(as.numeric(values),dates)

來源

2014-02-09 21:56:59 Roland

回答

相關問題