我編寫了下列R代碼來標識目錄中的重複文件。如何使用plyr包(或類似的)向量化for循環?我想要獲得比我提出的更爲習慣的R解決方案。如何使用Plyr,Apply或類似向量化此R代碼?
library("digest") # to compute the MD5 digest
test_dir = "/Users/user/Dropbox/kaggle/r_projects/test_photo"
filelist <- dir(test_dir, pattern = "JPG|AVI", recursive=TRUE,
all.files =TRUE, full.names=TRUE)
fl = list() #create and empty list to hold md5's and filenames
for (itm in filelist) {
file_digest = digest(itm, file=TRUE, algo="md5")
fl[[file_digest]]= c(fl[[file_digest]],itm)
}
fl
輸出(使用一個小的測試目錄):
> fl
$`5715b719723c5111b3a38a6ff8b7ca56`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480.JPG"
$`24fd4d7d252ca66c8d7a88b539c55112`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3481.JPG"
$`2a1d668c874dc856b9df0fbf3f2e81ec`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482 copy.JPG"
[4] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482.JPG"
我想:
h=ldply(filelist, digest, file=TRUE, algo="md5")
h$filenames=filelist
,但結束了一個唯一的行爲每個鍵值對(MD5的, 文件名)。我無法獲得所需的緊湊輸出。
(背景:作爲練習,我將Raymond Hettinger在他的PyCon AU 2011主題演講「What Makes Python Awesome」中提供的python代碼進行了轉換。幻燈片如下:http://slidesha.re/WKkh9M。我能夠將LOC減半,但我認爲我可以做得更好 - 並通過矢量化了解更多信息)。
還是遵循'分裂(H,H $摘要)''你的命令ldply'? –
Arun和Ben - 我的目標是擁有一個列表,其關鍵是md5哈希值,值是與每個唯一鍵相對應的文件名列表(請參閱示例輸出)。當我運行ldply(seq_along(filelist),function(idx)c(digest(filelist [idx],file = TRUE,algo =「md5」),filelist [idx]))時,結果被複制md5鍵和相關的文件名值。我試圖通過融化磕磕絆絆,無濟於事。 – goplayer