快速匹配,一個由R核心人員組成的小包,對查找進行哈希處理,以便初始化和特別是隨後的搜索更快。
你真正在做什麼是制定一個與每個句子共同的預定義級別的因素。他的C代碼中的緩慢步驟是對因子水平進行排序,通過爲因子函數的快速版本提供(唯一)因子水平列表,您可以避免這些因子水平。
如果你只是想要整數位置,你可以很容易地從因子轉換爲整數:許多人無意中這樣做。
你根本不需要一個因素,因爲你想要的只是match
。您的代碼也會生成一個邏輯向量,然後重新計算它的位置:match
只是直接進入位置。
library(fastmatch)
library(microbenchmark)
WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]
words_factor <- as.factor(WORDS)
# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))
bench_fun <- function(fun)
lapply(SENTENCES, fun)
# poster's slow solution:
hg_convert <- function(sentence)
return(which(WORDS %in% sentence))
jw_convert_match <- function(sentence)
match(sentence, WORDS)
jw_convert_match_factor <- function(sentence)
match(sentence, words_factor)
jw_convert_fastmatch <- function(sentence)
fmatch(sentence, WORDS)
jw_convert_fastmatch_factor <- function(sentence)
fmatch(sentence, words_factor)
message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
bench_fun(jw_convert_match),
bench_fun(jw_convert_match_factor),
bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
我把這個https://gist.github.com/jackwasey/59848d84728c0f55ef11
結果不格式化得非常好,我只想說,fastmatch帶或不帶要素投入是大大加快。
# starting benchmark one
Unit: microseconds
expr min lq mean median uq max neval
bench_fun(hg_convert) 665167.953 678451.008 704030.2427 691859.576 738071.699 777176.143 10
bench_fun(jw_convert_match) 878269.025 950580.480 962171.6683 956413.486 990592.691 1014922.639 10
bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764 10
bench_fun(jw_convert_fastmatch) 203.031 220.134 462.1246 289.647 305.070 2196.906 10
bench_fun(jw_convert_fastmatch_factor) 251.474 300.729 1351.6974 317.439 362.127 10604.506 10
# starting benchmark two, compare with factor vs vector of words
Unit: seconds
expr min lq mean median uq max neval
bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648 10
bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907 10
因此,我不會去並行實現的麻煩。
你用'mclapply'試過了嗎? – hrbrmstr
Thkx,但沒有我在Windows上,我只有一個核心 –
另外你有嘗試過'pmatch'而不是'which(%%..)'? –