2016-12-11 22 views
0

我想要統計第一列中每行中有多少字符串與string2類似。在兩列中找到類似的字符串

df1<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L, 1L, 1L, 
1L), .Label = c("gdyijq,udyhfs,gqdtr", "hdydg", "hishsgd,gugddf", 
"ydis"), class = "factor")), .Names = "string1", class = "data.frame", row.names = c(NA, 
-7L)) 

df2<- structure(list(string2 = structure(c(3L, 1L, 4L, 2L), .Label = c("0", 
"gqdtr", "hishsgd,gugddf", "ydis"), class = "factor")), .Names = "string2", class = "data.frame", row.names = c(NA, 
-4L)) 

我試圖結合這兩個成爲dfM沒有sucess

dfM <- cbind(df1,df2) 

DF1看起來像 字符串1 1 hishsgd,gugddf 2 hdydg 3 YDIS 4 gdyijq,udyhfs,gqdtr 5 gdyijq,udyhfs,gqdtr 6 gdyijq,udyhfs,gqdtr 7 gdyijq,udyhfs,gqdtr

和DF2,看起來像

 string2 
1 hishsgd,gugddf 
2    0 
3   ydis 
4   gqdtr 

我想有像這樣

dfN<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L, 1L, 1L, 
1L), .Label = c("gdyijq,udyhfs,gqdtr", "hdydg", "hishsgd,gugddf", 
"ydis"), class = "factor"), string2 = structure(c(4L, 2L, 5L, 
3L, 1L, 1L, 1L), .Label = c("", "0", "gqdtr", "hishsgd,gugddf", 
"ydis"), class = "factor")), .Names = c("string1", "string2"), class = "data.frame", row.names = c(NA, 
-7L)) 


################## second part ############### 

,第二部分是

dfN<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", 
    "hdydg", "hishsgd,gugddf", "ydis"), class = "factor"), string2 = structure(c(3L, 
    1L, 4L, 2L), .Label = c("0", "gqdtr", "hishsgd,gugddf", "ydis" 
    ), class = "factor")), .Names = c("string1", "string2"), class = "data.frame", row.names = c(NA, 
    -4L)) 

例如在第一行中

string1   string2 
hishsgd,gugddf hishsgd,gugddf 

所以它應該是2

第二行中

string1   string2 
hdydg     0 

它們不相似,其應該是0,那麼

等,期望輸出是像下面

renew<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", 
"hdydg", "hishsgd,gugddf", "ydis"), class = "factor"), string2 = structure(c(3L, 
1L, 4L, 2L), .Label = c("0", "gqdtr", "hishsgd,gugddf", "ydis" 
), class = "factor"), similar = c(2L, 0L, 1L, 1L)), .Names = c("string1", 
"string2", "similar"), class = "data.frame", row.names = c(NA, 
-4L)) 

回答

2

我們可以使用strsplit來拆分每列中的字符串,得到每個list元素上的公共元素intersectMap並找到lengthlengths

lst <- lapply(dfN, function(x) strsplit(as.character(x), ",")) 
renew1 <- transform(dfN, similar = lengths(Map(intersect, lst[[1]], lst[[2]]))) 
identical(renew, renew1) 
#[1] TRUE 
+0

@nik我不知道你是什麼之後,但儘量'庫(rowr); cbind。填充(df1,df2)' – akrun

+2

'庫(rowr); cbind.fill(df1,df2)'隨機填充其他字符串的空位。我上面展示了一個例子,我想要作爲輸出。順便說一句,我接受並喜歡你的答案 – nik

+0

@nik你可以使用'fill'參數,即'cbind.fill(df1,df2,fill = NA)' – akrun

1

或者您可以使用%in%做配套

dfN<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", 
    "hdydg", "hishsgd,gugddf", "ydis"), class = "factor"), string2 = structure(c(3L, 
    1L, 4L, 2L), .Label = c("0", "gqdtr", "hishsgd,gugddf", "ydis" 
    ), class = "factor")), .Names = c("string1", "string2"), class = "data.frame", row.names = c(NA, 
    -4L)) 
renew<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", 
"hdydg", "hishsgd,gugddf", "ydis"), class = "factor"), string2 = structure(c(3L, 
1L, 4L, 2L), .Label = c("0", "gqdtr", "hishsgd,gugddf", "ydis" 
), class = "factor"), similar = c(2L, 0L, 1L, 1L)), .Names = c("string1", 
"string2", "similar"), class = "data.frame", row.names = c(NA, 
-4L)) 

dfN 
renew 

# use strsplit to break up the cell values 
col1<- strsplit(as.character(dfN$string1),",") 
col2<- strsplit(as.character(dfN$string2),",") 

#use %in% to find match 
res<- mapply(FUN="%in%", col1, col2) 

#sum up the TRUE values 
res2<- lapply(res,sum) 

# merge the result 
resultDF<- data.frame(dfN, newcol= unlist(res2)) 

#test 
resultDF== renew #data.frame(dfN, newcol= 1:4 ) 
+0

謝謝我喜歡你的答案 – nik

相關問題