2017-01-26 56 views
1

如果問題的標題不是很清楚,請道歉。迭代組和計數數據幀之間的匹配

我有兩個data frame如下:

df1 
NAME FOLLOWS 
san big supa 
san EAU 
san simulate 
san spang 
glyn guido 
glyn claire 
glyn vincent 
glyn dan 
glyn peter 
glyn EAU 


df2 
FOLLOWS 
guido 
vincent 
EAU 
EUSC 
brian 
simulate 
peter 

我想在df1每個NAME並且還df1$FOLLOWS用於df1每個NAME長度df1$FOLLOWSdf2$FOLLOWS之間count匹配。對於這些數據幀,我期待輸出是這樣的:

df3 
NAME LENGTH_FOLLOWS COUNT_Match 
san  4   2 
glyn  6   4   

回答

1

您可以合併DF1與DF2第一,將只保留值出現在DF1。那麼你可以簡單地計數實例。

library(sqldf) 
sqldf('select NAME, count(NAME) as LENGTH_FOLLOWS , count(Actual_F) as COUNT_Match from (select t1.*, t2.FOLLOWS as Actual_F from df1 t1 left join df2 t2 on t1.FOLLOWS=t2.FOLLOWS) group by NAME') 

或者用基礎R

df1$index=match(df1$FOLLOWS, df2$FOLLOWS) 
aggregate(cbind(df1$FOLLOWS,df1$index), by = list(df1$NAME) , FUN = function(x) length(x[!is.na(x)])) 
+0

由於非NA元素的邏輯向量的sum。使用base R對我來說效果很好。 – Santosh

1

下面是使用data.table一個選項。將第一個data.frame轉換爲'data.table'(setDT(df1))並將'df2'加入on以創建索引列('ind')。然後,通過「NAME」分組,我們得到的行(.N)的數量和在「IND」

library(data.table) 
setDT(df1)[df2, ind := 1, on = .(FOLLOWS)] 
df1[, .(LENGTH_FOLLOWS = .N, COUNT_MATCH = sum(!is.na(ind))), NAME] 
# NAME LENGTH_FOLLOWS COUNT_MATCH 
#1: san    4   2 
#2: glyn    6   4 
+1

感謝您的選擇。這看起來也不錯。 – Santosh

相關問題