從Google Ngrams中有效地推導出詞同現矩陣

我需要使用Google Books N-grams的詞彙數據來構造一個（稀疏！）矩陣的詞共同出現（其中行是詞和列是相同的單詞，並且單元格反映它們出現在相同的上下文窗口中的次數）。所得到的tcm將被用於測量一系列詞彙統計量並作爲向量語義學方法（手套，LSA，LDA）的輸入。從Google Ngrams中有效地推導出詞同現矩陣

爲了參考，谷歌圖書（V2）的數據集被如下格式化（製表符分隔）

ngram  year match_count volume_count 
some word 1999 32    12   # example bigram

然而，問題是，當然，這些數據被超大型。雖然，我只需要幾十年的數據子集（大約20年的ngram），我對一個高達2的上下文窗口感到滿意（即使用trigram語料庫）。我有一些想法，但沒有一個看起來特別，很好，很好。

-Idea 1-最初或多或少這樣的：

# preprocessing (pseudo) 
for file in trigram-files: 
    download $file 
    filter $lines where 'year' tag matches one of years of interest 
    find the frequency of each of those ngrams (match_count) 
    cat those $lines * $match_count >> file2 
    # (write the same line x times according to the match_count tag) 
    remove $file 

# tcm construction (using R) 
grams <- # read lines from file2 into list 
library(text2vec) 
# treat lines (ngrams) as documents to avoid unrelated ngram overlap 
it   <- itoken(grams) 
vocab  <- create_vocabulary(it) 
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2) 
tcm  <- create_tcm(it, vectorizer) # nice and sparse

不過，我有一種預感，這可能不是最好的解決方案。 ngram數據文件已經包含n-gram形式的同現數據，並且有一個給出頻率的標籤。我有一種感覺應該有更直接的方式。

-Idea 2-我也在想cat'ing每個過濾NGRAM只有一次進入了新的文件（而不是複製它match_count次），然後創建一個空的中藥，然後循環較全（年 - 過濾）ngram數據集並記錄實例（使用match_count標籤），其中任何兩個詞共現出現以填充tcm。但是，數據很大，這種循環可能需要很長時間。

-Idea 3-我發現一個Python庫調用google-ngram-downloader，顯然有一個共生矩陣創建函數，但是看一下代碼，它會創建一個常規（非稀疏）矩陣（這將是巨大的，因爲大多數條目都是0），並且（如果我正確的話）它只是loops through everything（並且我假設一個Python循環遍佈這個數據將會超級低），所以它似乎更多地針對的是更小的數據子集。

編輯-Idea 4-跨越this old SO question來到詢問使用Hadoop和配置單元的類似的任務，與斷開鏈接AA簡答題和MapReduce的左右（其中沒有我熟悉的註釋，這樣我不知道從哪裏開始）。

但我想我不能成爲第一個與需要解決這樣的任務，鑑於NGRAM數據集的普及，和（非word2vec）分佈式語義的普及在tcm或dtm輸入上運行的方法;因此 - >

...問題：從Google Books Ngram數據中構建一個term-term co-occurrence矩陣會更合理/有效嗎？（這是所提議的完全不同的想法的變體; R首選但不是必需的）

來源

2017-01-25 user3554004

你能給誰都會算你共同occurecesies爲三克的例子嗎？它應該是什麼樣子。 –

那麼，使用（可能是天真的）ngrams-as-documents方法，就像'x < - list（c（「this」，「is」，「example」），c（「example」，「it」，「是「））; it < - itoken（x）; vocab < - create_vocabulary（it）; vectorizer < - vocab_vectorizer（vocab，skip_grams_window = 2）; tcm < - create_tcm（it，vectorizer）;打印（翻譯）; print（tcm）'但是這種感覺就像是漫長的過程（書籍/文檔 - > ngram - >將ngrams導入爲文檔 - >從ngrams創建跳過 - > create_tcm），而ngram基本上說明了co - 已經發生，並且數據給出了任何ngram發生的次數 – user3554004

我會給你一個關於如何做到這一點的想法。但可以在幾個地方改進。我在「通心粉式的」更好的解釋性特意寫，但可以推廣到比三克以上

ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54)) 
# here we split tri-grams to obtain words 
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array() 

# vocab here is vocabulary from chunk, but you can be interested first 
# to create vocabulary from whole corpus of ngrams and filter non 
# interesting/rare words 

vocab = unique(tokens_matrix) 
# convert char matrix to integer matrix for faster downstream calculations 
tokens_matrix_int = match(tokens_matrix, vocab) 
dim(tokens_matrix_int) = dim(tokens_matrix) 

ngram_dt[, token_1 := tokens_matrix_int[1, ]] 
ngram_dt[, token_2 := tokens_matrix_int[2, ]] 
ngram_dt[, token_3 := tokens_matrix_int[3, ]] 

dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)] 
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)] 
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1/distance 
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)] 

dt = rbindlist(list(dt_12, dt_13, dt_23)) 
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams 
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)] 

tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T, 
        giveCsparse = F, check = F, dimnames = list(vocab, vocab))

來源

2017-01-25 18:33:17

從Google Ngrams中有效地推導出詞同現矩陣

回答

相關問題