如何計算幾個文件的相關性？

-1

中有兩個目錄DIR1和DIR2 365二進制文件，這些文件具有相同的格式，字節，延伸等...如何計算幾個文件的相關性？

下面給出將在DIR1和DIR2作爲矢量讀出，然後計算相關的代碼。基本上我想要得到相關圖，我們只是計算每個網格像素的R值。假設我們要計算dir1和dir2之間的全局相關性映射，我們爲每個像素提供了來自dir1和dir2的兩列數據，並且可以計算該像素的R值，然後簡單地對全局像素執行循環。

dir1 <- list.files("C:\\cor", "*.bin", full.names = TRUE) 
dir2 <- list.files("C:\\cor2", "*.bin", full.names = TRUE) 
results <- list() 
    for (.files in dir1){ 
# read in the 365 files as a vector of numbers for dir1 
    file1 <- do.call(rbind,(lapply(.files, readBin , integer() , size = 2 , 
           n = 360 * 720 , signed = T))) 
    } 
    for (.files in dir2){ 
    # read in the 365 files as a vector of numbers for dir2 
    file2<- do.call(rbind,(lapply(.files, readBin , integer() , size = 2 , 
        n = 360 * 720 , signed = T))) 
    } 
    # calculate the correlation so we will get a correlation map 
for (.files in seq_along(dir1)){    
    results[[length(results) + 1L]]<- cor(file1 ,file2) 
    }

我得到這個錯誤：Error in cor(file1, file2) : allocMatrix: too many elements specified

來源

2012-12-03 sacvf

@Downvoter：在沒有評論的情況下下調沒有幫助。 –

'files.group'和'files.group2'與dir1'和'dir2'有什麼不同？ – plannapus

可以顯示'file1'和'file2'的外觀（使用'dput'，'head'或'str'）嗎？ – plannapus

我會重寫你的代碼通過以下方式（假設我理解正確的，你想要做的是比較file1每一行與file2每一行）：

dir1 <- list.files("C:\\cor", "*.bin", full.names = TRUE) 
dir2 <- list.files("C:\\cor2", "*.bin", full.names = TRUE) 
file1 <- do.call(rbind,lapply(dir1, readBin , integer() , size = 2 , 
           n = 360 * 720 , signed = T)) 
file2 <- do.call(rbind,lapply(dir2, readBin , integer() , size = 2 , 
           n = 360 * 720 , signed = T)) 
results <- apply(file1, 1, function(x){ apply(file2, 1, function(X){cor(x, X)})})

results將是一個矩陣（365×365），例如文件1的你的第x行（因此，在DIR1第x個文件）和file2的第y個行間的相關係數（因此，第y個fil e in dir2）是results[x,y]。然後您可以直接使用功能image(results)將其繪製爲熱圖。

編輯：澄清代碼的最後一行：它正好對應於以下for循環：

results <- array(dim = c(length(file1), length(file2))) 
for(i in 1:length(file1)){ 
    for(j in 1:length(file2)){ 
     results[i,j]<-cor(file1[i, ], file2[j, ]) 
     } 
    }

市價修改意見： @PaulHiemstra是比我快，但確確實實我即將提出類似的東西：

dir1 <- list.files("C:\\cor", "*.bin", full.names = TRUE) 
dir2 <- list.files("C:\\cor2", "*.bin", full.names = TRUE) 
file_tot<-array(dim=c(360,720,365,2)) 
for(i in 1:length(dir1)){ 
    file_tot[,,i,1] <- readBin(dir1[i], integer(), size = 2 ,n = 360 * 720 , signed = T) 
    file_tot[,,i,2] <- readBin(dir2[i], integer(), size = 2 ,n = 360 * 720 , signed = T) 
    } 
results<-apply(file_tot,c(1,2),function(x){cor(x[,1],x[,2])})

來源

2012-12-03 10:38:38 plannapus

不，這不正常;但是再次，我無法測試它，因爲我沒有你的文件，也沒有任何類似的文件。 – plannapus

你說每個目錄中有365個文件。因此，來自每個目錄的文件之間的關聯結果必須是365x365。否則我誤解了你的問題。 – plannapus

那麼你想關聯什麼？ – plannapus

如果你想計算每個x，y位置的時間相關性（看起來），我would將其轉換爲尺寸爲(nx, ny, ntsteps, ndatasets)的多維陣列，例如，以較小的例如數據集：

  # nx ny nsteps ndatasets 
dat = runif(20 * 30 * 100 * 2) 
dim(dat) = c(20, 30, 100, 2) 
> str(dat) 
num [1:20, 1:30, 1:100, 1:2] 0.969 0.482 0.974 0.682 0.856 ...

現在我們採取的事實apply也適用於多維數組，不僅矩陣優勢：

cor_result = apply(dat, c(1,2), function(x) cor(x[,1], x[,2])) 
> str(cor_result) 
num [1:20, 1:30] 0.06673 0.00943 -0.11265 -0.01157 -0.0024 ...

我們使用apply遍歷所有X ，y對來計算時間相關性。

關於您的大數據集，加載大約需要1.4 Gb。 R中的經驗法則是，您需要將數據集大小的3倍作爲RAM才能使用它。所以，如果你有8 Gb的RAM和64位R，這應該可以正常工作。另外，我經常做這些計算塊，因爲我只有4 Gb。例如，您可以首先處理前5行（y座標），而不是第二個5行等。

來源

2012-12-03 13:15:02

您使用我的解決方案嗎？ –

如何計算幾個文件的相關性？

回答

相關問題