在兩列中找到類似的字符串

我想要統計第一列中每行中有多少字符串與string2類似。在兩列中找到類似的字符串

df1<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L, 1L, 1L, 
1L), .Label = c("gdyijq,udyhfs,gqdtr", "hdydg", "hishsgd,gugddf", 
"ydis"), class = "factor")), .Names = "string1", class = "data.frame", row.names = c(NA, 
-7L)) 

df2<- structure(list(string2 = structure(c(3L, 1L, 4L, 2L), .Label = c("0", 
"gqdtr", "hishsgd,gugddf", "ydis"), class = "factor")), .Names = "string2", class = "data.frame", row.names = c(NA, 
-4L))

我試圖結合這兩個成爲dfM沒有sucess

dfM <- cbind(df1,df2)

DF1看起來像字符串1 1 hishsgd，gugddf 2 hdydg 3 YDIS 4 gdyijq，udyhfs，gqdtr 5 gdyijq，udyhfs，gqdtr 6 gdyijq，udyhfs，gqdtr 7 gdyijq，udyhfs，gqdtr

和DF2，看起來像

 string2 
1 hishsgd,gugddf 
2    0 
3   ydis 
4   gqdtr

我想有像這樣

dfN<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L, 1L, 1L, 
1L), .Label = c("gdyijq,udyhfs,gqdtr", "hdydg", "hishsgd,gugddf", 
"ydis"), class = "factor"), string2 = structure(c(4L, 2L, 5L, 
3L, 1L, 1L, 1L), .Label = c("", "0", "gqdtr", "hishsgd,gugddf", 
"ydis"), class = "factor")), .Names = c("string1", "string2"), class = "data.frame", row.names = c(NA, 
-7L)) 


################## second part ###############

，第二部分是

dfN<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", 
    "hdydg", "hishsgd,gugddf", "ydis"), class = "factor"), string2 = structure(c(3L, 
    1L, 4L, 2L), .Label = c("0", "gqdtr", "hishsgd,gugddf", "ydis" 
    ), class = "factor")), .Names = c("string1", "string2"), class = "data.frame", row.names = c(NA, 
    -4L))

例如在第一行中

string1   string2 
hishsgd,gugddf hishsgd,gugddf

所以它應該是2

第二行中

string1   string2 
hdydg     0

它們不相似，其應該是0，那麼

等，期望輸出是像下面

renew<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", 
"hdydg", "hishsgd,gugddf", "ydis"), class = "factor"), string2 = structure(c(3L, 
1L, 4L, 2L), .Label = c("0", "gqdtr", "hishsgd,gugddf", "ydis" 
), class = "factor"), similar = c(2L, 0L, 1L, 1L)), .Names = c("string1", 
"string2", "similar"), class = "data.frame", row.names = c(NA, 
-4L))

來源

2016-12-11 nik

我們可以使用strsplit來拆分每列中的字符串，得到每個list元素上的公共元素intersect和Map並找到length與lengths

lst <- lapply(dfN, function(x) strsplit(as.character(x), ",")) 
renew1 <- transform(dfN, similar = lengths(Map(intersect, lst[[1]], lst[[2]]))) 
identical(renew, renew1) 
#[1] TRUE

來源

2016-12-11 13:55:49 akrun

@nik我不知道你是什麼之後，但儘量'庫（rowr）; cbind。填充（df1，df2）' – akrun

'庫（rowr）; cbind.fill（df1，df2）'隨機填充其他字符串的空位。我上面展示了一個例子，我想要作爲輸出。順便說一句，我接受並喜歡你的答案 – nik

@nik你可以使用'fill'參數，即'cbind.fill（df1，df2，fill = NA）' – akrun

或者您可以使用%in%做配套

dfN<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", 
    "hdydg", "hishsgd,gugddf", "ydis"), class = "factor"), string2 = structure(c(3L, 
    1L, 4L, 2L), .Label = c("0", "gqdtr", "hishsgd,gugddf", "ydis" 
    ), class = "factor")), .Names = c("string1", "string2"), class = "data.frame", row.names = c(NA, 
    -4L)) 
renew<- structure(list(string1 = structure(c(3L, 2L, 4L, 1L), .Label = c("gdyijq,udyhfs,gqdtr", 
"hdydg", "hishsgd,gugddf", "ydis"), class = "factor"), string2 = structure(c(3L, 
1L, 4L, 2L), .Label = c("0", "gqdtr", "hishsgd,gugddf", "ydis" 
), class = "factor"), similar = c(2L, 0L, 1L, 1L)), .Names = c("string1", 
"string2", "similar"), class = "data.frame", row.names = c(NA, 
-4L)) 

dfN 
renew 

# use strsplit to break up the cell values 
col1<- strsplit(as.character(dfN$string1),",") 
col2<- strsplit(as.character(dfN$string2),",") 

#use %in% to find match 
res<- mapply(FUN="%in%", col1, col2) 

#sum up the TRUE values 
res2<- lapply(res,sum) 

# merge the result 
resultDF<- data.frame(dfN, newcol= unlist(res2)) 

#test 
resultDF== renew #data.frame(dfN, newcol= 1:4 )

來源

2016-12-11 14:58:55

謝謝我喜歡你的答案 – nik

在兩列中找到類似的字符串

回答

相關問題