加快查找程序

我有兩個表：coc_data和DT。 coc_data表包含對詞之間的共現頻率。其結構類似於：加快查找程序

word1 word2 freq 
1  A  B 1 
2  A  C 2 
3  A  D 3 
4  A  E 2

二表，DT包含頻率爲每個單詞的不同年份，例如：

word year weight 
1  A 1966  9 
2  A 1967  3 
3  A 1968  1 
4  A 1969  4 
5  A 1970  10 
6  B 1966  9

在現實中，coc_data擁有目前150.000行和DT有大約450.000行。下面是R代碼，它模擬兩個數據集。

# Prerequisites 
library(data.table) 
set.seed(123) 
n <- 5 

# Simulate co-occurrence data [coc_data] 
words <- LETTERS[1:n] 
# Times each word used 
freq <- sample(10, n, replace = TRUE) 
# Co-occurrence data.frame 
coc_data <- setNames(data.frame(t(combn(words,2))),c("word1", "word2")) 
coc_data$freq <- apply(combn(freq, 2), 2, function(x) sample(1:min(x), 1)) 

# Simulate frequency table [DT] 
years <- (1965 + 1):(1965 + 5) 
word <- sort(rep(LETTERS[1:n], 5)) 
year <- rep(years, 5) 
weight <- sample(10, 25, replace = TRUE) 
freq_data <- data.frame(word = word, year = year, weight = weight) 
# Combine to data.table for speed 
DT <- data.table(freq_data, key = c("word", "year"))

我的任務是根據使用下面的函數在DT表的頻率在coc_data表正常化頻率：

my_fun <- function(x, freq_data, years) { 
    word1 <- x[1] 
    word2 <- x[2] 
    freq12 <- as.numeric(x[3]) 
    freq1 <- sum(DT[word == word1 & year %in% years]$weight) 
    freq2 <- sum(DT[word == word2 & year %in% years]$weight) 
    ei <- (freq12^2)/(freq1 * freq2) 
    return(ei) 
}

然後我用apply()功能my_fun功能適用於coc_data表的每一行：

apply(X = coc_data, MARGIN = 1, FUN = my_fun, freq_data = DT, years = years)

因爲DT廁所kup表格非常大，整個映射過程需要很長時間。我想知道如何改進我的代碼來加速計算。

來源

2017-02-27 Andrej

我不認爲它會有很大的提升，但是你可以把'freq1 < - sum（DT [word == word1＆year％in％years] $ weight）'改成freq1 < - DT [word == word1＆如果你想使用data.table功能（我沒有測試它，我沒有使用data.table太多，所以檢查它是否以等效的方式工作）％year，％，sum] – Llopis

由於years參數是常數my_fun使用apply的實際使用情況，您可以先計算所有詞的頻率：

f<-aggregate(weight~word,data=DT,FUN=sum)

現在將其轉化成一個哈希值，例如：

hs<-f$weight 
names(hs)<-f$word

現在在my_fun通過查找hs [word]來使用預先計算的頻率。這應該會更快。

更妙 - 你正在尋找的答案是

(coc_data$freq)^2/(hs[coc_data$word1] * hs[coc_data$word2])

的data.table實施，這將是：

f <- DT[, sum(weight), word] 
vec <- setNames(f$V1, f$word) 

setDT(coc_data)[, freq_new := freq^2/(vec[word1] * vec[word2])]

這給了以下結果：

> coc_data 
    word1 word2 freq  freq_new 
1:  A  B 1 0.0014792899 
2:  A  C 1 0.0016025641 
3:  A  D 1 0.0010683761 
4:  A  E 1 0.0013262599 
5:  B  C 5 0.0434027778 
6:  B  D 1 0.0011574074 
7:  B  E 1 0.0014367816 
8:  C  D 4 0.
9:  C  E 1 0.0009578544 
10:  D  E 2 0.0047562426

來源

2017-02-27 11:34:18

加快查找程序

回答

相關問題