R中數以百萬計的小小匹配：需要性能

我有一百萬長度的單詞稱爲WORDS。我有一個名爲SENTENCES的9百萬個對象列表。我的列表中的每個對象都是一個由10-50個單詞長度向量表示的句子。這裏有一個例子：R中數以百萬計的小小匹配：需要性能

head(WORDS) 
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon" 

SENTENCES[[1]] 
[1] "how" "to" "interpret" "that" "picture"

我想我的名單中每一個句子轉換成數字向量，其元素對應於句子的單詞的話大矢量的位置。其實，我知道如何使用這個命令做到這一點：

convert <- function(sentence){ 
    return(which(WORDS %in% sentence)) 
} 

SENTENCES_NUM <- lapply(SENTENCES, convert)

的問題是，它需要太長時間。我的意思是我的RStudio爆炸，雖然我有一臺16Go RAM電腦。所以問題是你有什麼想法來加速計算？

來源

2015-10-04 hans glick

你用'mclapply'試過了嗎？ – hrbrmstr

Thkx，但沒有我在Windows上，我只有一個核心 –

另外你有嘗試過'pmatch'而不是'which（％％..）'？ –

快速匹配，一個由R核心人員組成的小包，對查找進行哈希處理，以便初始化和特別是隨後的搜索更快。

你真正在做什麼是制定一個與每個句子共同的預定義級別的因素。他的C代碼中的緩慢步驟是對因子水平進行排序，通過爲因子函數的快速版本提供（唯一）因子水平列表，您可以避免這些因子水平。

如果你只是想要整數位置，你可以很容易地從因子轉換爲整數：許多人無意中這樣做。

你根本不需要一個因素，因爲你想要的只是match。您的代碼也會生成一個邏輯向量，然後重新計算它的位置：match只是直接進入位置。

library(fastmatch) 
library(microbenchmark) 

WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]] 

words_factor <- as.factor(WORDS) 

# generate 100 sentences of between 5 and 15 words: 
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1)) 

bench_fun <- function(fun) 
    lapply(SENTENCES, fun) 

# poster's slow solution: 
hg_convert <- function(sentence) 
    return(which(WORDS %in% sentence)) 

jw_convert_match <- function(sentence) 
    match(sentence, WORDS) 

jw_convert_match_factor <- function(sentence) 
    match(sentence, words_factor) 

jw_convert_fastmatch <- function(sentence) 
    fmatch(sentence, WORDS) 

jw_convert_fastmatch_factor <- function(sentence) 
    fmatch(sentence, words_factor) 

message("starting benchmark one") 
print(microbenchmark(bench_fun(hg_convert), 
        bench_fun(jw_convert_match), 
        bench_fun(jw_convert_match_factor), 
        bench_fun(jw_convert_fastmatch), 
        bench_fun(jw_convert_fastmatch_factor), 
        times = 10)) 

# now again with big samples 
# generating the SENTENCES is quite slow... 
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1)) 
message("starting benchmark two, compare with factor vs vector of words") 
print(microbenchmark(bench_fun(jw_convert_fastmatch), 
        bench_fun(jw_convert_fastmatch_factor), 
        times = 10))

我把這個https://gist.github.com/jackwasey/59848d84728c0f55ef11

結果不格式化得非常好，我只想說，fastmatch帶或不帶要素投入是大大加快。

# starting benchmark one 
Unit: microseconds 
            expr   min   lq   mean  median   uq   max neval 
        bench_fun(hg_convert) 665167.953 678451.008 704030.2427 691859.576 738071.699 777176.143 10 
      bench_fun(jw_convert_match) 878269.025 950580.480 962171.6683 956413.486 990592.691 1014922.639 10 
    bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764 10 
     bench_fun(jw_convert_fastmatch)  203.031  220.134  462.1246  289.647  305.070 2196.906 10 
bench_fun(jw_convert_fastmatch_factor)  251.474  300.729 1351.6974  317.439  362.127 10604.506 10 

# starting benchmark two, compare with factor vs vector of words 
Unit: seconds 
            expr  min  lq  mean median  uq  max neval 
     bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648 10 
bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907 10

因此，我不會去並行實現的麻煩。

來源

2015-10-04 12:05:46

哦，謝謝。讓我們假設我只是想將世界映射爲整數 - 無論這些整數是什麼 - 因爲實際上，我不在乎使用文字位置來將句子轉換爲數字矢量，您是否更容易看到某些東西？ –

我假設你想要在每個句子中用相同的數字表示相同的單詞。如果情況並非如此，它確實簡化了這個問題，但我懷疑這就是你所追求的。 –

即使你不關心每個句子中單詞的順序，那麼它們將如何被R存儲（因爲沒有相當於C++的'std :: unsorted_set'）。 –

-1

速度不會更快，但它是完成事情的整齊方式。

library(dplyr) 
library(tidyr) 

sentence = 
    data_frame(word.name = SENTENCES, 
      sentence.ID = 1:length(SENTENCES) %>% 
    unnest(word.name) 

word = data_frame(
    word.name = WORDS, 
    word.ID = 1:length(WORDS) 

sentence__word = 
    sentence %>% 
    left_join(word)

來源

2015-10-04 16:38:03 bramtayl

R中數以百萬計的小小匹配：需要性能

回答

相關問題