由於大數據集R中的PCoA錯誤

對於我的工作項目，我必須執行PCoA（主座標分析又稱多維縮放）。但是，當使用R來執行此分析時，我遇到了一些問題。由於大數據集R中的PCoA錯誤

功能cmdscale僅接受基質或DIST作爲輸入，位於dist函數給出了錯誤：

Error: cannot allocate vector of size 4.2 Gb 
In addition: Warning messages: 
1: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) : 
    Reached total allocation of 4020Mb: see help(memory.size) 
2: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) : 
    Reached total allocation of 4020Mb: see help(memory.size) 
3: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) : 
    Reached total allocation of 4020Mb: see help(memory.size) 
4: In dist(mydata[c(3, 4)], method = "euclidian", diag = FALSE, upper = FALSE) : 
    Reached total allocation of 4020Mb: see help(memory.size)

當我採用了矩陣它改變輸入到該：

 [,1]   
[1,] Integer,33741 
[2,] Integer,33741

的數據集的內容不能在線發佈，但我可以給你的尺寸：數據集長33741行，寬11列，第一列是一個ID，其他10個值需要用於PCoA。

正如你可以看到的錯誤，我只使用2列，我已經得到一個內存錯誤。

現在我的問題：
是否有可能操縱數據的方式，我可以管理與dist功能的內存限制？
我在做什麼錯誤的矩陣函數，它將向量更改爲2列2行輸出？

我試過的東西：用垃圾回收清理，重新啓動GUI，重新啓動系統。

系統： Windows 7的64位酷睿i7 1.8GHz的920qm 4GB DDR3內存

使用代碼：

mydata <- read.table(file, header=TRUE) 

mydist <- dist(mydata[c(3,4)], method="euclidian", diag=FALSE, upper=FALSE) 
mymatrix <- matrix(mydata[c(3,4)], byrow=FALSE) 
mymatrix <- matrix(cbind(mydata[c(3,4)])) 

mycmdscale <- cmdscale(mydist, k=2, eig=FALSE, add=FALSE, x.ret=FALSE) 
mycmdscale <- cmdscale(mymatrix, k=2, eig=FALSE, add=FALSE, x.ret=FALSE) 

plot(mycmdscale)

當然我沒有這個順序運行的代碼，但是這個代碼包含了方法我試圖加載數據。

在此先感謝您的答覆。

來源

2013-05-14 Sinshz

在R中執行此操作的內存太少，該操作將所有對象保存在內存中。我可能沒有完全正確的計算（我忘記了R的對象的大小），但只是保持不相似矩陣，你需要〜9GB的RAM。

> print(object.size(matrix(0, ncol = 34000, nrow = 34000)), units = "Gb") 
8.6 Gb

dist將在內部表示逃脫少，因爲它是真的只存儲0.5 * (nr * (nr - 1))雙打（nr在輸入數據行數）：

> print(object.size(numeric(length = 0.5 * 34000 * 33999)), units = "Gb") 
4.3 Gb

[這很有可能就是其中你正在看到的錯誤是來自]

實際上，你需要20-30GB以上的內存，以便在計算出相異矩陣後做任何有用的工作。即使你可以計算它們，PCoA解決方案的特徵向量也只需要9Gb的RAM。

所以一個更相關的問題是，你希望與c做什麼？ 34000個樣本/觀察？

要想從mydata[3:4]矩陣可以使用

as.matrix(mydata[3:4])

，或者，如果你有因素，並希望保住自己的數字解釋

data.matrix(mydata[3:4])

來源

2013-05-14 15:10:55

那麼我希望對樣品進行主座標分析，但是我工作的公司沒有能夠處理這些計算的服務器，所以我試圖做一個本地計算機的解決方法。今天我休息一天，所以我不能嘗試data.matrix，但如果它能夠工作，我會在明天再回到你身邊。已經感謝您的時間，因爲這確實給了我需要告訴我的老闆的信息。 – Sinshz 2013-05-15 11:21:42

所以我試圖用你的方法加載矩陣，這些方法確實可以創建矩陣。但是，執行PCoA（cmdscale）的函數不接受這些類型的矩陣，它要求：cmdscale中的錯誤（testmatrix，k = 2，eig = FALSE，add = FALSE，x.ret = FALSE）：距離必須是'dist'或方矩陣的結果。我懷疑用我有限的記憶來進行分析是可能的，但任何想法都是值得歡迎的。 – Sinshz 2013-05-17 06:48:29

@Sinshz這兩件事情沒有關係。我認爲這兩個問題是無關的;爲此道歉。我所展示的是如何從數據框的選定列/組件獲取矩陣。 'dist'仍然會失敗，因爲數據的行數需要更多的RAM來存儲它，而不是你的機器可用的。 – 2013-05-17 14:34:55

我知道這是舊的，但想到我d pitch在我得到的...

我有點驚訝@Gavin Simpson沒有提到計算歐幾里德距離矩陣的主座標分析是與主成分分析相同（至少兩者都使用比例= 1）。

這是根據p。 Borcard，D.，Gillet，F.，& Legendre，P.（2011）。第5章不受約束的排序（第115-151頁）。紐約，紐約：施普林格紐約。 DOI：10.1007/978-1-4419-7976-6

我可以在我的本地機器系統運行此罰款：Windows 7的64位i5-2500 3.3GHz的8GB RAM

library(vegan) # to perform PCA and associated operations 
library(ggplot2) # plotting (not necessary, but nice) 
library(grid) # arrow() 

#make a big test set like OP's 
test<-data.frame(id=seq(34000), var1=rnorm(34000), var2=rnorm(34000), 
       var3=rnorm(34000),var4=rnorm(34000),var5=rnorm(34000), 
       var6=rnorm(34000),var7=rnorm(34000),var8=rnorm(34000), 
       var9=rnorm(34000),var10=rnorm(34000)) 
#calculate PCA 
test.pca<-rda(test, scale=TRUE) 

#calculate percent variation on each axis 
test.pca.percExp<-round(eigenvals(test.pca)/sum(eigenvals(test.pca))*100, 2) 

#extract scores for plotting 
test.pca.sc<-scores(test.pca, choices=c(1,2), 
          display=c("sites", "species"), scaling=1) 

test.pca.site<-data.frame(test.pca.sc$sites) 
test.pca.spe<-data.frame(test.pca.sc$species) 
test.pca.spe$VAR<-rownames(test.pca.spe) 

#make the plot 
test.pca.p<-ggplot(test.pca.site, aes(PC1, PC2)) + 
    xlab(sprintf("PC1 %s%s", test.pca.percExp[1], "%")) + 
    ylab(sprintf("PC2 %s%s", test.pca.percExp[2], "%")) 

#add points and biplot arrows to plot 
test.pca.p + 
    geom_point() + 
    geom_segment(data = test.pca.spe, 
       aes(x = 0, xend = PC1, y = 0, yend = PC2), 
       arrow = arrow(length = unit(0.25, "cm")), colour = "grey") + 
    geom_text(data=test.pca.spe, 
      aes(x=PC1,y=PC2,label=VAR), 
      size=3, position=position_jitter(width=-2, height=0.1))+ 
    guides(color = guide_legend(title = "Var"))

enter image description here

#hard to see the points with arrows, so plot without the arrows 
test.pca.p + 
    geom_point()

enter image description here

我無意中發現了這個問題，因爲我有一個曼哈頓距離矩陣同樣的問題，我的回答將其沒有幫助（據我所知，可能有一種方法可以在PCA之前轉換數據，從而得到相同的結果）。這個答案基本上會給出我相信OP正在尋找的結果。希望這可以幫助別人...

來源

2014-11-26 02:05:10

由於大數據集R中的PCoA錯誤

回答

相關問題