如何將列添加到基於另一列中的字符串的R中的data.table？

我想根據另一列中的字符串將列添加到data.table中。這是我的數據，我已經想盡了辦法：如何將列添加到基於另一列中的字符串的R中的data.table？

 
                        Params 
1:         { clientID : 459; time : 1386868908703; version : 6} 
2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 
3:             { clientID : 988; time : 1388939739771} 
4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 
5:             { clientID : 459; time : 1388090530634}

代碼來創建此表：

DT = data.table(Params=c("{ clientID : 459; time : 1386868908703; version : 6}","{ clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}","{ clientID : 988; time : 1388939739771}","{ clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}","{ clientID : 459; time : 1388090530634}"))

我想分析的「PARAMS」 -column文字和創建新列基於它的文字。例如，我希望有一個名爲「user」的新列，它只保存Params字符串中的「user：」後面的數字。添加的列應該是這樣的：

 
                        Params   user 
1:         { clientID : 459; time : 1386868908703; version : 6} NA 
2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 459001 
3:             { clientID : 988; time : 1388939739771} NA 
4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 459001 
5:             { clientID : 459; time : 1388090530634} 459001

我創建了下面的函數解析（在本例中爲「用戶」）：

myparse <- function(searchterm, s) { 
    s <-gsub("{","",s, fixed = TRUE) 
    s <-gsub(" ","",s, fixed = TRUE) 
    s <-gsub("}","",s, fixed = TRUE) 
    s <-strsplit(s, '[;:]') 
    s <-unlist(s) 
    if (length(s[which(s==searchterm)])>0) {s[which(s==searchterm)+1]} else {NA} 
}

然後我用下面的函數添加一列：

DT <- transform(DT, user = myparse("user", Params))

這工作在包含在所有的行，但「用戶」，這是僅包含在兩排中的情況下不工作「時間」的情況。將返回以下錯誤：

Error in data.table(list(Params = c("{ clientID : 459; time : 1386868908703; version : 6}", : 
    argument 2 (nrow 2) cannot be recycled without remainder to match longest nrow (5)

我該如何解決這個問題？謝謝！

來源

2014-01-22 Miriam

下面是使用正則表達式完成這個任務的方式：

myparse <- function(searchterm, s) { 
    res <- rep(NA_character_, length(s)) # NA vector 
    idx <- grepl(searchterm, s) # index for strings including the search term 
    pattern <- paste0(".*", searchterm, " : ([^;}]+)[;}].*") # regex pattern 
    res[idx] <- sub(pattern, "\\1", s[idx]) # extract target string 
    return(res) 
}

您可以使用此功能來添加新列，例如，對於user：

DT[, user := myparse("user", Params)]

新列包含NA爲沒有user字段的行：

DT[, user] 
# [1] NA  "459001" NA  "459001" NA

來源

2014-01-22 12:18:46

非常感謝。適用於我提供的數據。我將如何調整正則表達式以允許像「{clientID：461; time：1386770861254; type：new; newUser：461002}」這樣的字符串，其中包含類似「type：new」的字符串？ – Miriam

@Miriam這個例子應該是什麼結果，''type：new「'或''new」'？ –

該列應該命名爲「type」，值爲「new」（如上面的用戶：「459001」）。 – Miriam

我會用一些外部的解析器，例如：

library(yaml) 

DT = data.frame(
    Params=c("{ clientID : 459; time : 1386868908703; version : 6}","{ clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}","{ clientID : 988; time : 1388939739771}","{ clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}","{ clientID : 459; time : 1388090530634}"), 
    stringsAsFactors=F 
    ) 

conv.to.yaml <- function(x){ 
    gsub('; ','\n',substr(x, 3, nchar(x)-1)) 
} 

tmp <- lapply(DT$Params, function(x) yaml.load(conv.to.yaml(x)))

隨後將分析清單合併爲數據幀：

unames <- unique(unlist(sapply(tmp, names))) 
res <- as.data.frame( do.call(rbind, lapply(tmp, function(x)x[unames]))) 
colnames(res) <- unames 
res

結果是非常接近你心裏有什麼，但你需要考慮更好地處理時間值：

> res 
    clientID  time version      id user 
1  459 -405527905  6      NULL NULL 
2  459 -405612269 NULL 52a9ea8b534b2b0b5000575f 459001 
3  988 1665303163 NULL      NULL NULL 
4  459 -405626089 NULL 52a9ec00b73cbf0b210057e9 459001 
5  459 816094026 NULL      NULL NULL

來源

2014-01-22 13:27:52 df239

如何將列添加到基於另一列中的字符串的R中的data.table？

回答

相關問題