2013-12-17 44 views
36

給定一個具有各種類型列的(預先存在的)數據框,將其所有字符列轉換爲因子的最簡單方法是什麼,而不影響其他類型的任何列?將所有數據幀字符列轉換爲因子

下面是一個例子data.frame

df <- data.frame(A = factor(LETTERS[1:5]), 
       B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)), 
       D = letters[1:5], 
       E = paste(LETTERS[1:5], letters[1:5]), 
       stringsAsFactors = FALSE) 
df 
# A B  C D E 
# 1 A 1 TRUE a A a 
# 2 B 2 TRUE b B b 
# 3 C 3 FALSE c C c 
# 4 D 4 FALSE d D d 
# 5 E 5 TRUE e E e 
str(df) 
# 'data.frame': 5 obs. of 5 variables: 
# $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 
# $ B: int 1 2 3 4 5 
# $ C: logi TRUE TRUE FALSE FALSE TRUE 
# $ D: chr "a" "b" "c" "d" ... 
# $ E: chr "A a" "B b" "C c" "D d" ... 

我知道我可以做:

df$D <- as.factor(df$D) 
df$E <- as.factor(df$E) 

是否有辦法來自動執行此過程多一點?

+0

@AnandaMahto謝謝。我通常努力避免轉換因素,並且經常被迫設置全球選項。所以,這個想法對我來說很簡單。 – Roland

回答

45
DF <- data.frame(x=letters[1:5], y=1:5, stringsAsFactors=FALSE) 

str(DF) 
#'data.frame': 5 obs. of 2 variables: 
# $ x: chr "a" "b" "c" "d" ... 
# $ y: int 1 2 3 4 5 

(煩人)默認as.data.frame是將所有字符列轉換爲因子列。你可以在這裏使用:

DF <- as.data.frame(unclass(DF)) 
str(DF) 
#'data.frame': 5 obs. of 2 variables: 
# $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 
# $ y: int 1 2 3 4 5 
77

羅蘭的答案對於這個特定的問題很好,但我想我會分享一個更廣義的方法。

DF <- data.frame(x = letters[1:5], y = 1:5, z = LETTERS[1:5], 
       stringsAsFactors=FALSE) 
str(DF) 
# 'data.frame': 5 obs. of 3 variables: 
# $ x: chr "a" "b" "c" "d" ... 
# $ y: int 1 2 3 4 5 
# $ z: chr "A" "B" "C" "D" ... 

## The conversion 
DF[sapply(DF, is.character)] <- lapply(DF[sapply(DF, is.character)], 
             as.factor) 
str(DF) 
# 'data.frame': 5 obs. of 3 variables: 
# $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 
# $ y: int 1 2 3 4 5 
# $ z: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 

對於轉換中,assign(DF[sapply(DF, is.character)])的左手側子集是字符列。在右側,對於該子集,您使用lapply來執行您需要執行的任何轉換。 R足夠聰明,可以用結果替換原始列。

這方便的一點是,如果您想以其他方式或進行其他轉換,就像在左側更改要查找的內容一樣簡單,並指定要將其更改爲右側的內容。

+0

謝謝,非常有用,特別是在一個RMySQL請求給出一個只有字符向量的數據框之後。不要忘記(像我一樣)在預先不是字符的列中設置適當類型的數字邏輯等。 –

16

As @Raf Z評論了此question,dplyr現在有mutate_if。超級有用,簡單和可讀。

> str(df) 
'data.frame': 5 obs. of 5 variables: 
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 
$ B: int 1 2 3 4 5 
$ C: logi TRUE TRUE FALSE FALSE TRUE 
$ D: chr "a" "b" "c" "d" ... 
$ E: chr "A a" "B b" "C c" "D d" ... 

> df <- df %>% mutate_if(is.character,as.factor) 

> str(df) 
'data.frame': 5 obs. of 5 variables: 
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5 
$ B: int 1 2 3 4 5 
$ C: logi TRUE TRUE FALSE FALSE TRUE 
$ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 
$ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5 
1

我曾經做過簡單的for循環。至於@ A5C1D2H2I1M1N2O1R2T1答案,lapply是一個不錯的解決方案。但是,如果您轉換所有列,則以前需要使用data.frame,否則最終將使用list。執行時間差異很小。

mm2N=mm2New[,10:18] 
str(mm2N) 
'data.frame': 35487 obs. of 9 variables: 
$ bb : int 4 6 2 3 3 2 5 2 1 2 ... 
$ vabb : int -3 -3 -2 -2 -3 -1 0 0 3 3 ... 
$ bb55 : int 7 6 3 4 4 4 9 2 5 4 ... 
$ vabb55: int -3 -1 0 -1 -2 -2 -3 0 -1 3 ... 
$ zr : num 0 -2 -1 1 -1 -1 -1 1 1 0 ... 
$ z55r : num -2 -2 0 1 -2 -2 -2 1 -1 1 ... 
$ fechar: num 0 -1 1 0 1 1 0 0 1 0 ... 
$ varr : num 3 3 1 1 1 1 4 1 1 3 ... 
$ minmax: int 3 0 4 6 6 6 0 6 6 1 ... 

# For solution 
t1=Sys.time() 
for(i in 1:ncol(mm2N)) mm2N[,i]=as.factor(mm2N[,i]) 
Sys.time()-t1 
Time difference of 0.2020121 secs 
str(mm2N) 
'data.frame': 35487 obs. of 9 variables: 
$ bb : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ... 
$ vabb : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ... 
$ bb55 : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ... 
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ... 
$ zr : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ... 
$ z55r : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ... 
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ... 
$ varr : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ... 
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ... 

#lapply solution 
mm2N=mm2New[,10:18] 
t1=Sys.time() 
mm2N <- lapply(mm2N, as.factor) 
Sys.time()-t1 
Time difference of 0.209012 secs 
str(mm2N) 
List of 9 
$ bb : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ... 
$ vabb : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ... 
$ bb55 : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ... 
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ... 
$ zr : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ... 
$ z55r : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ... 
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ... 
$ varr : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ... 
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ... 

#data.frame lapply solution 
mm2N=mm2New[,10:18] 
t1=Sys.time() 
mm2N <- data.frame(lapply(mm2N, as.factor)) 
Sys.time()-t1 
Time difference of 0.2010119 secs 
str(mm2N) 
'data.frame': 35487 obs. of 9 variables: 
$ bb : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ... 
$ vabb : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ... 
$ bb55 : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ... 
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ... 
$ zr : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ... 
$ z55r : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ... 
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ... 
$ varr : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ... 
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ... 
0

最簡單的方法是使用下面給出的代碼。它會自動化將所有變量作爲R中的一個數據框中的因素進行轉換的整個過程。它對我來說工作得非常好。這裏的food_cat是我正在使用的數據集。將其更改爲您正在處理的那個。

for(i in 1:ncol(food_cat)){ 

food_cat[,i] <- as.factor(food_cat[,i]) 

} 
相關問題