2017-04-04 26 views
1

有誰知道如何設計一種快速方式來計算兩列的相對重疊?我想知道「b」中有多少個「a」元素。理想情況下,會生成一個列'c',用於存儲每行的這些比較值。很堅持在這一個..設置字符數據操作(整數字符串)

b <- c("20", "1, 8, 19, 20, 22, 23, 28, 34, 41", 
     "3, 8, 10, 11, 18, 20, 26, 37", 
     "1, 3, 6, 18, 21, 35", "NA", "1, 21, 33", "14, 37", 
     "4, 14, 18, 23, 33, 37, 40", "14", 
     "4, 14, 20, 23, 33, 37, 40", 
     "2, 3, 5, 7, 8, 10, 14, 16, 18, 23, 25, 34, 40", 
     "6, 8, 10, 14, 19, 29, 33, 35, 36, 39, 41", 
     "1, 20", "1, 28, 36", "14", 
     "1, 6, 33, 12, 39", "28", 
     "1, 6, 11, 13, 18, 19, 21, 28, 33, 35, 36, 39", 
     "35, 40", "20", "20, 38", "6, 8, 19, 22, 29, 32, 33, 34, 40", 
     "1, 10, 21, 25, 33, 35, 36, 39, 40", "36") 

a <- c("14", "10", "8, 39", "26, 39", "14, 20", "33, 36", "14", 
     "NA", "8, 39", "33, 36", "8, 39", "1, 36", "10", "28, 33", 
     "14, 20", "33, 40", "28, 34", "1, 36", 
     "8, 39", "20", "14, 20", "29, 33", "36", "14") 

df <- data.frame(a, b) 

df$a <- as.character(df$a) 
df$b <- as.character(df$b) 

此功能非常適用於排18,但不容易擴展與sapply或同等學歷。

length(intersect(as.numeric(unlist(strsplit(df$a[18], ", "))),   
       as.numeric(unlist(strsplit(df$b[18], ", ")))))/
length(as.numeric(unlist(strsplit(df$b[18], ", ")))) 
# gives 
[1] 0.1666667 

length(intersect(as.numeric(unlist(strsplit(df$a[5], ", "))), 
       as.numeric(unlist(strsplit(df$b[5], ", ")))))/
length(as.numeric(unlist(strsplit(df$b[5], ", ")))) 
# gives 
[1] 0 
Warning messages: 
1: In intersect(as.numeric(unlist(strsplit(df$a[5], ", "))), as.numeric(unlist(strsplit(df$b[5], : 
    NAs introduced by coercion 
2: NAs introduced by coercion 

回答

1

我不明白爲什麼需要與as.numeric進行轉換。這是給你警告的那個人。 「NA」被認爲是數據框中的字符值,並且這是一個不能轉換爲數字的字符值。

請注意,警告不是錯誤,因此您的代碼實際上也適用於第5行(除非您期望NA)。

我會做到以下幾點:

getCounts <- function(x,y){ 
    x <- strsplit(x,", ")[[1]] 
    y <- strsplit(y,", ")[[1]] 
    mean(y %in% x) 
} 
# gives 
> getCounts(df$a[5],df$b[5]) 
[1] 0 

這基本上是你做了什麼,但寫起來有點更清晰,並使用mean(..%in%..)代替length(intersect(..,..))/...

爲了做到這一點在兩個向量a和b,你可以使用mapply

out <- mapply(getCounts,df$a, df$b)