html
  • xml
  • r
  • 2014-02-08 63 views 0 likes 
    0

    我想從給定的網頁導入數據到R,例如this one從html源碼讀取XML數據到R

    在源代碼中(但不是實際的頁面上),我想獲得的數據存儲在JavaScript代碼的單一線,開始是這樣的:

    chart_Line1.setDataXML("<graph rotateNames (stuff omitted) > 
    <set value='699.99' name='16.02.2013' /> 
    <set value='731.57' name='18.02.2013' /> 
    <set value='more values' name='more dates' /> 
    ... 
    <trendLines> (now a different command starts, stuff omitted) 
    </trendLines></graph>") 
    

    (請注意,我爲了便於閱讀,我們使用了換行符;數據在原始文件中只有一行,只需要導入以chart_Line1.setDataXML開頭的行 - 如果您想自己查看,那麼在源代碼中是第56行)

    我可以使用scan("URLofFile", what="raw")將整個html文件讀入字符串,但是如何從中提取數據?

    我可以使用what="..."指定數據格式,請記住沒有換行符來分隔數據,但在不相關的前綴和後綴中有幾個換行符?

    這是可以用R工具以很好的方式完成的東西,還是您建議這個數據採集應該使用不同的腳本來完成?

    回答

    0

    隨着一些試驗&錯誤,我能夠找到包含數據的確切行。我讀了整個HTML文件,然後處理所有其他行。

    require(zoo) 
    require(stringr) 
    # get html data, scrap all lines but the interesting one 
    theurl <- "https://www.magickartenmarkt.de/Black_Lotus_Unlimited.c1p5093.prod" 
    sec <- scan(file =theurl, what = "character", sep="\n") 
    sec <- sec[45] 
    # extract all strings of the form "value='X'", where X is a 1 to 3 digit number with some separator and 2 decimal places 
    values <- str_extract_all(sec, "value='[0-9]{1,3}.[0-9]{2}'") 
    # dispose of all non-numerical, non-separator values 
    values <- str_replace_all(unlist(values),"[^0-9/.]","") 
    # get all dates in the form "name='DD.MM.YYYY" 
    dates <- str_extract_all(sec, "name='[0-9]{2}.[0-9]{2}.[0-9]{4}'") 
    # dispose of all non-numerical, non-separator values 
    dates <- str_replace_all(unlist(dates),"[^0-9/.]","") 
    # convert dates to canonical format 
    dates <- as.Date(dates,format="%d.%m.%Y") 
    # put values and dates into a list of ordered observations, converting the values from characters to numbers first. 
    MyZoo <- zoo(as.numeric(values),dates) 
    
    相關問題