2017-09-29 46 views
0
I want to compare two string vectors as follows: 

Test1<-c("Everything is normal","It is all sunny","Its raining cats and dogs","Mild") 

Test2<-c("Everything is normal","It is thundering","Its raining cats and dogs","Cloudy") 

Filtered<-data.frame(Test1,Test2) 

預期輸出:如何比較兩個字符串之間匹配的語句數向量

Number the same: 2 
Number present in Test1 and not in Test2: 2 
Number present in Test2 and not in Test1: 2 

我也想看看這串是不同的,因此,其他預期的輸出應如下(也是原始數據幀的一部分)

Same<-c("Everything is normal","Its raining cats and dogs") 
OnlyInA<-c("It is all sunny") 
OnlyInB<-c("It is thundering","Cloudy") 

我曾嘗試:

Filtered$Same<-intersect(Filtered$A,Filtered$B) 
Filtered$InAButNotB<-setdiff(Filtered$A,Filtered$B) 

但是當我嘗試最後一行我得到的誤差更換有127行,數據有400個(如果我使用一個較長的數據集)。

我想這是因爲我只返回有差異的行,所以列不匹配。我如何NA哪些行與setdiff沒有區別,以便我可以將它保留在原始數據框中?

+0

函數包是什麼函數過濾?我沒有看到它在基地R. –

+0

道歉的錯字。我已編輯它 –

+0

在您的已過濾數據框中,您是否將缺失值設置爲不等長向量的NA值? –

回答

1

基數R outer函數將對兩個向量的每個元素的每個組合應用一個函數。因此,使用outer'=='會比較每個向量的每個元素:

Test1<-c("Everything is normal","It is all sunny","Its raining cats and dogs") 
Test2<-c("Everything is normal","It is thundering","Its raining cats and dogs","Cloudy") 

# test each element in Test1 for equality with each element in Test2 
compare <- outer(Test1, Test2, '==') 

# calculate overlaps and uniques 
overlaps <- sum(compare) # number of overlaps: 2 
unique.test1 <- (rowSums(compare) == 0) # in Test1 but not Test2 
unique.test2 <- (colSums(compare) == 0) # in Test2 but not Test1 

# return uniques 
OnlyInA <- Test1[unique.test1] 
OnlyInB <- Test2[unique.test2] 
same <- Test1[rowSums(compare) == 1] 

# counts 
n.unique.a <- sum(unique.test1) 
n.unique.b <- sum(unique.test2) 

另外,該%in%操作是這樣的事情也很有用:使用tidyverse功能

Test1[Test1 %in% Test2] 
[1] "Everything is normal"  "Its raining cats and dogs" 

Test1[!(Test1 %in% Test2)] 
[1] "It is all sunny" 

Test2[!(Test2 %in% Test1)] 
[1] "It is thundering" "Cloudy"  
+0

如何使用%in%來比較數據框中的兩列列表? –

+0

看看dplyr :: summarize() – jdobres

0

,你可以試一下如:

Filtered %>% 
    summarise(comm = sum(Test1 %in% Test2), 
      InA = sum(!(Test1 %in% Test2)), 
      InB = sum(!(Test2 %in% Test1))) 

雖然,對於處理矢量,如果你只對th e聚合計數,你也可以嘗試以下內容

length(intersect(Test1,Test2)) 
length(setdiff(Test1,Test2)) 
+0

但是,如何將結果返回到原始數據框而不會出現錯誤 –

+1

什麼錯誤?正如你在評論中提到的那樣,如果兩列的行數相等,它不會給你一個錯誤。如果您收到其他錯誤消息,請相應更新問題。 – Aramis7d

+1

所以問題是如何處理如果列沒有相同數量的行? –

相關問題