遞歸ftp下載，然後解壓GZ文件

我有一個多步驟的文件下載過程中，我想在R做我有中間的一步，但不是第一個和第三個......遞歸ftp下載，然後解壓GZ文件

# STEP 1 Recursively find all the files at an ftp site 
# ftp://prism.oregonstate.edu//pub/prism/pacisl/grids 
all_paths <- #### a recursive listing of the ftp path contents??? #### 

# STEP 2 Choose all the ones whose filename starts with "hi" 
all_files <- sapply(sapply(strsplit(all_paths, "/"), rev), "[", 1) 
hawaii_log <- substr(all_files, 1, 2) == "hi" 
hi_paths <- all_paths[hawaii_log] 
hi_files <- all_files[hawaii_log] 

# STEP 3 Download & extract from gz format into a single directory 
mapply(download.file, url = hi_paths, destfile = hi_files) 
## and now how to extract from gz format?

來源

2011-03-08 J. Win.

行必須將其爲R？ HTTP最適合使用，但在FTP上並不完美。更通用的語言，比如Python，會更適合這類問題。 – chmullig 2011-03-08 02:35:41

是的，我試圖避免添加任何外部工具......現在我已經通過調用R的命令行wget做了一個解決方法，但是我希望能夠將它作爲一個獨立的R傳遞給某個人腳本 – 2011-03-08 02:52:29

只需複製和粘貼文本文件名並在一個循環中使用download.file就足夠簡單了 - 因此它爲您的用戶進行了硬編碼，但仍然是獨立的（或者您可以通過ftp進入站點和mget ...） – mdsumner 2011-03-08 03:02:09

如果我用internet2選項啓動R，我可以讀取ftp頁面的內容。即

C:\Program Files\R\R-2.12\bin\x64\Rgui.exe --internet2

（在Windows上啓動[R快捷鍵可以被修改，添加Internet2的參數 - 單擊鼠標右鍵/屬性/目標，或者只是運行在命令行 - 和明顯的在GNU/Linux）。

該網頁上的文字可以這樣寫：

download.file("ftp://prism.oregonstate.edu//pub/prism/pacisl/grids", "f.txt") 
txt <- readLines("f.txt")

這一點更多的工作來解析出目錄列表，然後讀取它們遞歸的基礎文件。

## (something like) 
dirlines <- txt[grep("Directory <A HREF=", txt)] 

## split and extract text after "grids/" 
split1 <- sapply(strsplit(dirlines, "grids/"), function(x) rev(x)[1]) 

## split and extract remaining text after "/" 
sapply(strsplit(split1, "/"), function(x) x[1]) 
[1] "dem" "ppt" "tdmean" "tmax" "tmin"

這是關於這裏，這似乎停止非常有吸引力，而且變得有點費力，所以我真的建議不同的選項。毫無疑問，使用RCurl可能會有更好的解決方案，並且我建議您學習使用ftp客戶端和您的用戶。命令行ftp，匿名登錄和mget都很容易工作。

第二代互聯網的選項是一個類似FTP站點說明如下：

https://stat.ethz.ch/pipermail/r-help/2009-January/184647.html

來源

2011-03-08 04:12:25 mdsumner

第一部分是方便知道。除了啓動選項，還有'setInternet2（TRUE）'。我認爲子目錄的遞歸函數是從那裏做的一種方式，但至少現在我可以從頁面獲取文本。 – 2011-03-08 19:14:33

對於部分1，RCurl可能會有所幫助。 getURL函數檢索一個或多個URL; dirlistonly列出目錄的內容而不檢索文件。該功能的其餘部分創建網址的下一級

library(RCurl) 
getContent <- function(dirs) { 
    urls <- paste(dirs, "/", sep="") 
    fls <- strsplit(getURL(urls, dirlistonly=TRUE), "\r?\n") 
    ok <- sapply(fls, length) > 0 
    unlist(mapply(paste, urls[ok], fls[ok], sep="", SIMPLIFY=FALSE), 
      use.names=FALSE) 
}

因此，與

dirs <- "ftp://prism.oregonstate.edu//pub/prism/pacisl/grids"

開始，我們可以調用這個函數，尋找的東西，看起來像目錄，一直持續到完成

fls <- character() 
while (length(dirs)) { 
    message(length(dirs)) 
    urls <- getContent(dirs) 
    isgz <- grepl("gz$", urls) 
    fls <- append(fls, urls[isgz]) 
    dirs <- urls[!isgz] 
}

我們可以再使用getURL，但這次在fls（或fls的元素，在循環中）檢索實際文件秒。或者，也許更好的打開一個URL連接，並使用gzcon來解壓縮和處理文件。沿

con <- gzcon(url(fls[1], "r")) 
meta <- readLines(con, 7) 
data <- scan(con, integer())

來源

2011-03-08 17:59:15

這不適合我：我得到一個'1''5'然後'錯誤在點[[1L]] [[1L]]：下標越界'我試圖步驟：第一個'fls'分配大頭針似乎並不是一個有效的目錄，在urls的末尾有一個'\ r'。有趣的是，'dirlistonly'不會出現在'getURL（）'幫助頁面中。 – 2011-03-08 19:27:13

我想在Windows上，strsplit應該是「\ r \ n *」。 RCurl依賴於系統庫，可用的特定選項取決於所安裝的庫的版本。參見'listCurlOptions（）';在Linux/MacOS上，可以使用'man curl_easy_setopt';不確定Windows。 – 2011-03-08 20:13:19

ftp.root <- where are the files 
dropbox.root <- where to put the files 

#===================================================================== 
# Function that downloads files from URL 
#===================================================================== 

fdownload <- function(sourcelink) { 

    targetlink <- paste(dropbox.root, substr(sourcelink, nchar(ftp.root)+1, 
nchar(sourcelink)), sep = '') 

    # list of contents 
    filenames <- getURL(sourcelink, ftp.use.epsv = FALSE, dirlistonly = TRUE) 
    filenames <- strsplit(filenames, "\n") 
    filenames <- unlist(filenames) 

    files <- filenames[grep('\\.', filenames)] 
    dirs <- setdiff(filenames, files) 
    if (length(dirs) != 0) { 
    dirs <- paste(sourcelink, dirs, '/', sep = '') 
    } 

    # files 
    for (filename in files) { 

    sourcefile <- paste(sourcelink, filename, sep = '') 
    targetfile <- paste(targetlink, filename, sep = '') 

    download.file(sourcefile, targetfile) 
    } 

    # subfolders 
    for (dirname in dirs) { 

    fdownload(dirname) 
    } 
}

來源

2012-06-15 18:27:51

來調用函數：fdownload（ftp.root） – 2012-06-15 18:30:12

遞歸ftp下載，然後解壓GZ文件

回答

相關問題