data.frame中的字符串轉換向量元素

我有一個巨大的數據框df，在一列中的「年 - 月」值如下：「YYYYMM」。目前數據類型是一個數字。快照：data.frame中的字符串轉換向量元素

> df[[1]][1:10] 
[1] 201001 201001 201001 201001 201001 201001 201001 201001 201001 201001 
> str(df) 
'data.frame': 2982393 obs. of 11 variables: 
$ YearMonth : int 201001 201001 201001 201001 201001 201001 201001 201001 201001 201001 ... 
$ ...

我想是在形如「YYYY-MM」這個值轉換成字符串（最終以因子），能夠與其他數據幀進行比較。

我正在努力尋找一種簡單的方法來改變價值。

我試過使用as.Date和format函數。但是由於這些價值觀沒有任何日子，所以它對Strings來說並不奏效。使用Numerics（與dataframe列相同），我甚至遇到了其他問題。

> as.Date("201001", format = "%Y%m") 
[1] NA 

> as.Date(201001, format = "%Y%m") 
Error in as.Date.numeric(201001, format = "%Y%m") : 
    'origin' must be supplied 
> as.Date(df[[1]], format = "%Y%m") 
Error in as.Date.numeric(df[[1]], format = "%Y%m") : 
    'origin' must be supplied

我能夠改變只有一個值，使用subset和字符串的串聯。我寫了下面的公式，來處理一個元素：

transformString <- function(x) { # x = value 
    return (paste(cbind(substring(x, 1, 4),"-",substring(x,5,6)), collapse = '')) 
}

問題：我沒有找到一個簡單的方法，以該功能適用於data.frame的一整列，其他不僅僅是遍歷所有元素：

transformStringVector <- function(x) { # x = vector 
    for(i in 1:length(x)) { 
     x[i]<-transformString(x[i]) 
    } 
    return (x) 
}

這遠遠不夠優雅和性能不好。我試圖用apply（見下文）之類的東西，但面臨着錯誤...（我承認我真的不明白的apply功能）

> temp <- apply(df[[1]], 1, transformString) 
Error in apply(df[[1]], 1, transformString) : 
    dim(X) must have a positive length

有誰有內這種轉變的替代一個巨大的數據幀？或更一般;將類似字符串的轉換應用於data.frame中的元素的簡單方法？

來源

2012-04-10 FBE

要獲得關於明確將其應用到一個data.frame問題，你可以使用訪問列運營商$。所以你可以使用這裏提供的任何一個函數（我會用substr變體）去做。如果你打算轉換成一個因素，我會先做。

> df <- data.frame(a=1:5,b=5:1,d=200101:200105) 
> df 
    a b  d 
1 1 5 200101 
2 2 4 200102 
3 3 3 200103 
4 4 2 200104 
5 5 1 200105 
> #Convert to a factor now for performance reasons. 
> df$d <- as.factor(df$d) 
> df$d <- paste(substr(df$d, 1, 4), "-", substr(df$d, 5,6), sep="") 
> df 
    a b  d 
1 1 5 2001-01 
2 2 4 2001-02 
3 3 3 2001-03 
4 4 2 2001-04 
5 5 1 2001-05 

> typeof(df$d) 
[1] "character" 
> df$d <- as.factor(df$d) 
> df 
    a b  d 
1 1 5 2001-01 
2 2 4 2001-02 
3 3 3 2001-03 
4 4 2 2001-04 
5 5 1 2001-05 
> typeof(df$d) 
[1] "integer"

請注意，根據您的data.frame如何「龐大」的是，你可能會被轉換爲第一要素，那麼就轉換水平，以複姓日期來獲得更好的性能。

> df <- data.frame(a=rep(1:5,1000000),b=rep(5:1,1000000),d=rep(200101:200105, 1000000)) 
> nrow(df) 
[1] 5000000 
> # Hyphenate first 
> system.time(df$d <- paste(substr(df$d, 1, 4), "-", substr(df$d, 5,6), sep="")) + system.time(df$d <- as.factor(df$d)) 
    user system elapsed 
    9.65 0.61 10.31 
> 
> #Factor first 
> system.time(df$d <- as.factor(df$d)) + system.time(levels(df$d) <- paste(substr(levels(df$d), 1, 4), "-", substr(levels(df$d), 5,6), sep="")) 
user system elapsed 
0.68 0.25 0.93

所以，這取決於你data.frame的屬性，你可以通過先做保理，以提高性能10倍。

P.S.如果真的是關心性能，那麼可以通過使用hash-backed factor來獲得更好的代理代碼屬性（快速解決方案的最慢部分）。

來源

2012-04-10 15:11:28

不錯！這真的很有幫助！這讓我更深入瞭解性能和因素。謝謝！ – FBE 2012-04-11 09:11:00

爲什麼

> as.Date("201001", format = "%Y%m") 
[1] NA

不工作的原因，是是R日期每天需要的組件。由於你的日期沒有提供，你會得到一個缺失的值。爲了規避這一點，只需添加一個一天組件：

R> x = c("201001","201102") 
R> x = paste(x, "01", sep="")

所以我做了所有日期的當月第一天：

R> y = as.Date(x, "%Y%m%d") 
[1] "2010-01-01" "2011-02-01"

然後可以使用format得到你想要的東西：

R> format(y, "%Y-%m") 
[1] "2010-01" "2011-02"

來源

2012-04-10 14:58:02 csgillespie

如果你只是希望列值轉換成字符串指定格式，並不在乎具有date格式，substr()和paste()都接受向量作爲參數：

xx<-c(201011,201003,201002,201010,201009,201005,201001,201001,201001,201001) 

paste(substr(xx,1,4),substr(xx,5,6),sep="-") 
# [1] "2010-11" "2010-03" "2010-02" "2010-10" "2010-09" "2010-05" "2010-01" 
# [8] "2010-01" "2010-01" "2010-01"

通過這種方式，你不必使用apply()

來源

2012-04-10 15:05:41 BenBarnes

data.frame中的字符串轉換向量元素

回答

相關問題