2017-05-25 26 views
0

我有以下結構的數據集:確定行的值的聯合OCCURENCES在data.table

dput(structure(foc[1:50])) 
structure(list(firm_id = c("Texas", "Texas", "Texas", "Micron", 
"Micron", "DowCor", "DowCor", "DowCor", "DowCor", "DowCor", "DowCor", 
"Altera", "Altera", "Texas", "Texas", "Texas", "Molex", "Molex", 
"DowCor", "DowCor", "DowCor", "NSC", "NSC", "Micron", "Micron", 
"AAV", "AAV", "AAV", "AMD", "AMD", "DowCor", "DowCor", "Molex", 
"Molex", "Molex", "NSC", "NSC", "NSC", "Micron", "Micron", "CORN", 
"CORN", "DowCor", "DowCor", "Zilog", "Zilog", "CORN", "CORN", 
"CORN", "Micron"), pnum = c(5351876, 5351876, 5351876, 5362632, 
5362632, 5364633, 5364633, 5364633, 5364633, 5364633, 5364633, 
5369314, 5369314, 5370301, 5370301, 5370301, 5370551, 5370551, 
5371128, 5371128, 5371128, 5372410, 5372410, 5376577, 5376577, 
5383340, 5383340, 5383340, 5384272, 5384272, 5384383, 5384383, 
5384435, 5384435, 5384435, 5385861, 5385861, 5385861, 5387534, 
5387534, 5387558, 5387558, 5389365, 5389365, 5389565, 5389565, 
5392376, 5392376, 5392376, 5393694), date = structure(c(8769, 
8769, 8769, 8804, 8804, 8838, 8838, 8838, 8838, 8838, 8838, 8818, 
8818, 8769, 8769, 8769, 8772, 8772, 8779, 8779, 8779, 8798, 8798, 
8946, 8946, 8848, 8848, 8848, 8944, 8944, 8796, 8796, 8793, 8793, 
8793, 8839, 8839, 8839, 8890, 8890, 8887, 8887, 8803, 8803, 8772, 
8772, 8866, 8866, 8866, 8931), class = "Date"), PRIM = c("228", 
"257", "269", "257", "438", "264", "424", "428", "514", "521", 
"977", "326", "714", "228", "257", "269", "220", "439", "424", 
"427", "524", "188", "303", "257", "438", "257", "361", "62", 
"257", "438", "528", "556", "174", "361", "439", "148", "257", 
"438", "257", "438", "106", "501", "424", "528", "257", "438", 
"385", "428", "501", "257"), N = c(3L, 3L, 3L, 2L, 2L, 6L, 6L, 
6L, 6L, 6L, 6L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L)), .Names = c("firm_id", 
"pnum", "date", "PRIM", "N"), sorted = "pnum", class = c("data.table", 
"data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000000140788>) 

它看起來非常像這樣:

foc 
     firm_id pnum  date PRIM N 
    1: Texas 5351876 1994-01-04 228 3 
    2: Texas 5351876 1994-01-04 257 3 
    3: Texas 5351876 1994-01-04 269 3 
    4: Micron 5362632 1994-02-08 257 2 
    5: Micron 5362632 1994-02-08 438 2 
    ---         
91731: Intel 7472285 2003-06-25 713 3 
91732: Intel 7472289 2004-12-21 381 2 
91733: Intel 7472289 2004-12-21 713 2 
91734: Intel 7472390 2003-10-01 712 2 
91735: Intel 7472390 2003-10-01 718 2 

我有一個較大的data.table其中上述是df其中一個子集。具體而言,以上開始於1994年,並且數據集df可追溯至1980年。除了爲了清楚起見,在df中的名稱在df data.table中被稱爲prim,所以PRIM列被稱爲prim

我想確定更大數據集中PRIM對的出現。當兩個PRIM與同一個pnum共同發生時,就存在對。沒有兩個相同的PRIM可以發生在相同的pnum上,並且數據集中的每個pnum具有2到8個PRIM。 此外,我想通過使用「日期」來強加時間限制,即我只想考慮小於5歲的pnum。

例如,上述數據中的第一個pnum = 5351876.它有三個不同的PRIM,因此有三對(228,257),(228,269)和(257,269)。在data.table示例中,有一個pnum具有6個不同的PRIM,因此其中一個具有15個不同的對。注意一對的順序是不相關的,所以(228,257)=(257,228)。

下面的代碼做了一些簡單的事情,我也需要。它計算每個PRIM在5年前出現的次數,但我不確定如何確定特定對出現的頻率。

findpairs <- data.table() 
findpairs <- data.table(rbind(findpairs, foc[, {print(.GRP) ; k = pnum ; p = PRIM ; y = unique(date) 
             df[(date < y & date > (y - (5*365 + 1)) & p == prim), .N]} 
              , by = .(pnum, PRIM)])) 

任何建議都非常歡迎

PS:在第二階段,我會希望能夠包括兩個「firm_id」條件,以及:排除焦點firm_id或只能看一個firm_id。這就是爲什麼這個變量現在保存在data.table中,但沒有被使用。

編輯1:在第一次嘗試答案後,我應該澄清所需的輸出。可能有更優化的解決方案生成不同的輸出,但這是我認爲會很棒的: 數據表有5列:pnum,date(pnum的日期),prim,paired primpair incidence in 5y before date。記住一對是不可知的,哪一個prim最先出現,並且只有當在同一pnum內的df data.table中發現兩個PRIM值時才存在一對。

希望澄清! 下面的功能

回答

0

這是我想出的解決方案。您可以使用以下功能創建組合。

make_prim_pairs <- function(values,n=2){ 
    combinations <- (apply(t(combn(values,min(n,length(values)))),1,paste,collapse=",")) 
    return(combinations) 
} 

所以,如果你想找到的整個數據集,則對:

findpairs <- foc[,.(primPairs = make_prim_pairs(prim)),by=pnum] 

這應該找到PNUM所有對。您可以爲數據添加條件並製作對。

y <- some_date 
findpairs <- foc[date < y & date > (y - (5*365 + 1)),.(primPairs = make_prim_pairs(prim)),by=pnum] 

讓我知道這是否有幫助。

+0

嗨,感謝您的建議。我將不得不在工作中實現它,看看究竟是什麼結果,所以我不能100%確定這個評論是否有意義,但是這個函數是計算對的出現次數還是隻是識別它們?我不知道'combn'函數究竟做了什麼,這就是我問的原因。稍後再嘗試並找回你!感謝您的幫助:) – SJDS

+0

嗨,再次,我在辦公室和你建議的功能似乎沒有工作,或者至少它不能提供我可以解釋的結果。我用期望的輸出更新問題,希望能夠幫助澄清我希望得到的結果。 – SJDS

+0

@SJDS如果您可以提供最終輸出結果的樣子(通過手動創建),那麼這將非常有幫助,因爲我仍然在努力理解您嘗試實現的目標。 –

0

我發現了一個適用於小數據集的解決方案,但現在已經在超過18小時的較大數據集上運行。不知道它有多接近完成,但我想我會分享解決方案。也許有人可以理解並改進它。

# Create all possible distinct pairs of prim classes that exist in the dataset df 
setkey(df, pnum) 
a <- df[df, allow.cartesian = T] # cartesion join to combine all possible pairs 
a <- a[a$prim != a$i.prim] # delete pairs consisting of the same prim values 
a[, idx:= .I] # add index 
a$pair <- a[,paste0(min(prim, i.prim),"_",max(prim, i.prim)),by = idx][[2]] # create pairs based on a single logic:1_2 must be same as 2_1 
DT1 <- a[, .N, by = .(firm_id, pnum, date, pair)] # this is to delete the repeated pairs 
rm(a) 

# Create all possible distinct pairs of prim classes that exist in the subset foc 
setkey(foc, pnum) 
a <- foc[foc, allow.cartesian = T] # cartesion join to combine all possible pairs 
a <- a[a$PRIM != a$i.PRIM] # delete pairs consisting of the same prim values 
a[, idx:= .I] # add index 
a$pair <- a[,paste0(min(PRIM, i.PRIM),"_",max(PRIM, i.PRIM)),by = idx][[2]] # create pairs based on a single logic:1_2 must be same as 2_1 
DT2 <- a[, .N, by = .(firm_id, pnum, date, pair)] # this is to delete the repeated pairs 

rm(a) 
DT1[, N:= NULL] ; DT2[, N:= NULL] # unwanted columns 

setnames(DT2, "pair", "PAIR") # only for clarity purposes in the formula below. This is the post 1994 data set. 

couples <- data.table() 
couples <- data.table(rbind(couples, DT2[, {k = pnum ; p = PAIR ; y = unique(date) 
             DT1[(date < y & date > (y - (5*365 + 1)) & p == pair), .N]} 
              , by = .(pnum, PAIR)])) 

#此公式給出 - 我認爲 - 獨特配對在過去5年出現的次數。

couples$lowp <- sub("_.+","", couples$PAIR) # split up the pair 
couples$highp <- sub(".+_","", couples$PAIR) # split up the pair 

這樣做。接下來的步驟是然後以匹配的次數和lowp highp出現在數據庫中(通過在OP findpairs),其被簡單地匹配,並計算所需的變量

coup <- couples 
coup$n_lowp <- counts$n_p[match(paste(coup$pnum,"",coup$lowp), paste(counts$pnum,"",counts$PRIM))] 
coup$n_highp <- counts$n_p[match(paste(coup$pnum,"",coup$highp), paste(counts$pnum,"",counts$PRIM))] 

coup$yaya <- with(coup, n_pairs/(n_lowp + n_highp - n_pairs)) 

我肯定有實現是更有效的方法,但它起作用(緩慢)。