2015-11-02 72 views
0

我正在嘗試使用R背心爲webscrape過去3個月的納斯達克關閉日期,所以我可以玩弄數據。在R網上刮雅虎金融(帶R背心)

問題是我似乎無法找到正確的xpath來返回表。我已經嘗試了很多使用chrome的「檢查元素」來查找xpath以及用於chrome的「SelectorGadget」插件。

似乎大多數人已經做到了這一點與蟒蛇,但我在R更舒服,特別是使用R背心網絡刮,所以我希望我不孤單!

我已在下面發佈我的代碼。我相信問題在於識別xpath。 這裏是網頁的一個示例... http://finance.yahoo.com/q/hp?s=CSV

後,我得到一個工作,我希望把它放在一個循環是我下面的問題代碼....

謝謝!

library("rvest") 
library("data.table") 
library("xlsx") 


#Problem Code 

company <- 'CSV' 
url <- paste("http://finance.yahoo.com/q/hp?s=",toString(company),sep="") 
url <-html(url) 
select_table <- '//table' #this is the line I think is incorrect 
fnames <- html_nodes(url, xpath=select_table) %>% html_table(fill=TRUE) 
STOCK <- fnames[[1]] 
STOCKS <- rbind(STOCK, STOCKS) 



#--------------------------------------------------------------------- 
#Loop for use later 

companylist <- read.csv('companylist.csv') #this is a list of all company tickers in the NASDAQ 
STOCK <- data.frame() 
STOCKS <- data.frame(Date=character(),Open=character(),High=character(),Low=character(),Close=character(),Volume=character(), AdjClose=character()) 
for (i in 1:3095) { 
    company <- companylist[i,1] 
    url <- paste("http://finance.yahoo.com/q/hp?s=",toString(company),sep="") 
    url <-html(url) 
    select_table <- '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[4]' 
    fnames <- html_nodes(url,xpath = select_table) %>% html_table(fill=TRUE) 
    STOCK <- fnames[[1]] 
    STOCKS <- rbind(STOCK, STOCKS) 

} 
View(STOCKS) 
+4

如果您的目標只是讓價格看一下'quantmod'包,它允許您請求大量數據。 – etienne

+0

@etienne這正是我正在尋找的。希望我以前知道那個包裝!謝謝。 – bpheazye

回答

1

你想搶股價嗎?

https://gist.github.com/jaehyeon-kim/356cf62b61248193db25#file-downloadstockdata

# assumes codes are known beforehand 
codes <- c("ABT", "ABBV", "ACE", "ACN", "ACT", "ADBE", "ADT", "AES", "AET", "AFL", "AMG", "A", "GAS", "APD", "ARG", "AKAM", "AA") 
urls <- paste0("http://www.google.com/finance/historical?q=NASDAQ:", 
codes,"&output=csv") 
paths <- paste0(codes,"csv") 
missing <- !(paths %in% dir(".", full.name = TRUE)) 
missing 

# simple error handling in case file doesn't exists 
downloadFile <- function(url, path, ...) { 
# remove file if exists already 
if(file.exists(path)) file.remove(path) 
# download file 
tryCatch(
download.file(url, path, ...), error = function(c) { 
# remove file if error 
if(file.exists(path)) file.remove(path) 
# create error message 
c$message <- paste(substr(path, 1, 4),"failed") 
message(c$message) 
} 
) 
} 
# wrapper of mapply 
Map(downloadFile, urls[missing], paths[missing]) 

你可以試試這一點。 。 。

library(knitr) 
library(lubridate) 
library(stringr) 
library(plyr) 
library(dplyr) 
{% endhighlight %} 

The script begins with creating a folder to save data files. 


{% highlight r %} 
# create data folder 
dataDir <- paste0("data","_","2014-11-20-Download-Stock-Data-1") 
if(file.exists(dataDir)) { 
     unlink(dataDir, recursive = TRUE) 
     dir.create(dataDir) 
} else { 
     dir.create(dataDir) 
} 
{% endhighlight %} 

After creating urls and file paths, files are downloaded using `Map` function - it is a warpper of `mapply`. Note that, in case the function breaks by an error (eg when a file doesn't exist), `download.file` is wrapped by another function that includes an error handler (`tryCatch`). 


{% highlight r %} 
# assumes codes are known beforehand 
codes <- c("MSFT", "TCHC") # codes <- c("MSFT", "1234") for testing 
urls <- paste0("http://www.google.com/finance/historical?q=NASDAQ:", 
       codes,"&output=csv") 
paths <- paste0(dataDir,"/",codes,".csv") # back slash on windows (\\) 

# simple error handling in case file doesn't exists 
downloadFile <- function(url, path, ...) { 
     # remove file if exists already 
     if(file.exists(path)) file.remove(path) 
     # download file 
     tryCatch(   
      download.file(url, path, ...), error = function(c) { 
        # remove file if error 
        if(file.exists(path)) file.remove(path) 
        # create error message 
        c$message <- paste(substr(path, 1, 4),"failed") 
        message(c$message) 
      } 
    ) 
} 
# wrapper of mapply 
Map(downloadFile, urls, paths) 
{% endhighlight %} 


Finally files are read back using `llply` and they are combined using `rbind_all`. Note that, as the merged data has multiple stocks' records, `Code` column is created. 



{% highlight r %} 
# read all csv files and merge 
files <- dir(dataDir, full.name = TRUE) 
dataList <- llply(files, function(file){ 
     data <- read.csv(file, stringsAsFactors = FALSE) 
     # get code from file path 
     pattern <- "/[A-Z][A-Z][A-Z][A-Z]" 
     code <- substr(str_extract(file, pattern), 2, nchar(str_extract(file, pattern))) 
     # first column's name is funny 
     names(data) <- c("Date","Open","High","Low","Close","Volume") 
     data$Date <- dmy(data$Date) 
     data$Open <- as.numeric(data$Open) 
     data$High <- as.numeric(data$High) 
     data$Low <- as.numeric(data$Low) 
     data$Close <- as.numeric(data$Close) 
     data$Volume <- as.integer(data$Volume) 
     data$Code <- code 
     data 
}, .progress = "text") 

data <- rbind_all(dataList) 
{% endhighlight %} 
+0

任何想法如何添加到此代碼來選擇特定範圍的日期?該網站有能力選擇日期,但我不知道如何通過代碼改變。謝謝你的幫助! – bpheazye