2016-07-13 58 views
0

下面是將國家智能存儲在myfiles中的多個數據框應用PCA的代碼。如何從文件列表中刪除總和爲零的列

## Get file names for a working directory ### 
temp = list.files(pattern="*.csv") 

## Read files ### 
myfiles = lapply(temp, read.csv) 

### Name the files ### 

names(myfiles)<-c("mCRC_2015_Q1","mCRC_2015_Q2","mCRC_2015_Q3","mCRC_2015_Q4") 

##### to check the names of the columns ####### 
names(myfiles$mCRC_2015_Q1) 

##### to change the names of the columns ###### 

colnames = c("Insufficient efficacy","Issues around safety/tolerability","Inconvenient dosage regimen/administration","Price issues","Not reimbursed","Not included on hospital/government medicines formulary","Insufficient clinical data available for acceptance","Previously used for this patient","Prescription only possible in selected cases with detailed justification to authorities/payers ","I don’t have enough scientific information about it","Lack of experience in this setting","Involved in clinical trial with other drugs","Patient not appropriate for Targeted therapy","Patient not appropriate for cetuximab (Erbitux)","Others","Country") 


for (i in seq_along(myfiles)){ 
    colnames(myfiles[[i]]) <- colnames 
} 

##### Delete all those columns which have zero sum from each dataframe ##### 
for(i in 1:length(myfiles)){ 

    myfiles[[i]] <- myfiles[[i]][,which(!lapply(myfiles,FUN = function(x){colSums(x!=0)>0}))] 

} 

####### Run PCA for each dataframe country wise #### 
Myfiles<- split(myfiles, myfiles$Country) 
for(i in 1:length(Myfiles)){ 
    assign(paste0("pca", i), prcomp(Myfiles[[i]][which(names(myfiles)!="Country")], center=T, scale.=T)) 
} 

這些都是我所面臨的問題:
1)如何刪除所有那些都只有零值的列。
2)我們如何應用prcomp命令對每個數據幀countrywise(國家是數據幀中的變量之一)
3)從加載矩陣我怎麼能得到前4個最相關的變量(不論符號)爲每個數據幀。

+1

這就是太多的問題。請一次一個。 –

+0

@RichardScriven請回答第一個..!謝謝 ! – Kavya

+0

@Kavya你能舉個例子嗎?它會讓你更容易幫助你。 – Learner

回答

0

我要回答的問題1,如何刪除列中只有零值的data.frame:

exampledat <- data.frame(zero = rep(0,20), one= rep(1, 20), 
         two = rep(1, 20), 
         mixed = c(0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1), 
         zeroagain = rep(0,20)) 

del_zero_cols <- function(datafr){ 
    datafr[,apply(datafr,2, function(value) any(value!=0, na.rm=TRUE))] 
} 

del_zero_cols(exampledat) 
+0

感謝!但我的問題是我怎樣才能將相同的東西應用於存儲在列表中的多個數據框。 – Kavya

+0

我給你一個函數del_zero_cols(),你可以很容易地使用lapply()應用於所有列表成員。 – Bernhard

0

有幾個選項。我會做這樣的事情處理的值:

myfiles <- 
    list(
    q1 = data.frame(r_1 = rep(0, times = 3), r_2 = c(0,0,1), country = c(LETTERS[1:3])), 
    q2 = data.frame(r_1 = rep(0, times = 3), r_2 = c(0,0,0), country = c(LETTERS[1:3])), 
    q2 = data.frame(r_1 = rep(0, times = 3), r_2 = c(1,0,0), country = c(LETTERS[1:3])) 
) 

# Merge the dataframes into one 
merged_myfiles = do.call(rbind, myfiles) 
merged_myfiles$file = gsub("\\.[0-9]+$", "", rownames(merged_myfiles)) 

# Clean columns that are all 0 
cleaned_data = merged_myfiles[,!sapply(merged_myfiles, function(col) all(col == 0))] 

# The by is a neat base function that allows you to do things on subsets 
# the output is a list 
by(cleaned_data, cleaned_data$country, function(df){ 
    mean(df$r_2) 
}) 

# Used dplyr for grouping analyses 
library(dplyr) 
library(magrittr) 
cleaned_data %>% 
    group_by(country) %>% 
    do({ 
    data.frame(mean = mean(.$r_2), file = .$file[1]) 
    }) 

by選項爲您提供:

cleaned_data$country: A 
[1] 0.3333333 
------------------------------------------------------------------------------------------------------------------------------------------ 
cleaned_data$country: B 
[1] 0 
------------------------------------------------------------------------------------------------------------------------------------------ 
cleaned_data$country: C 
[1] 0.3333333 

雖然dplyr給出:

Source: local data frame [3 x 3] 
Groups: country [3] 

    country  mean file 
    <fctr>  <dbl> <fctr> 
1  A 0.3333333  q1 
2  B 0.0000000  q1 
3  C 0.3333333  q1 

對於選擇最大prcomp輸出I會建議採用以下方法:

prcomp(USArrests) %>% 
    extract("rotation") %>% 
    unlist() %>% 
    abs() %>% 
    order(decreasing = TRUE) %>% 
    extract(1:4) %>% 
    data.frame(row = . %% ncol(USArrests), 
      col = ceiling(./ncol(USArrests))) 

原來prcomp(USArrests)$rotation看起來是這樣的:

   PC1   PC2   PC3   PC4 
Murder 0.04170432 -0.04482166 0.07989066 -0.99492173 
Assault 0.99522128 -0.05876003 -0.06756974 0.03893830 
UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914 
Rape  0.07515550 0.20071807 0.97408059 0.07232502 

,並從magrittr管道輸出準確顯示變量感興趣:

. row col 
1 2 2 1 
2 13 1 4 
3 7 3 2 
4 12 0 3