2017-08-13 68 views
0

爲了在處理非常大的語料庫樣本時節省內存空間,我期待僅取前10個1grams並將其與所有2至5grams結合以形成我的單一的quanteda :: dfmSparse對象將用於自然語言處理[nlp]預測。對所有1克進行操作將毫無意義,因爲只有前十名[或二十名]才能使用我正在使用的簡單後退模型。加入quanteda dfm前十個1克與所有dfm 2至5克

我無法找到指示它只返回頂部##特徵的quanteda :: dfm(corpusText,...)參數。所以根據包作者@KenB在其他線程中的評論,我使用dfm_select/remove函數來提取前十個1grams,並基於「quanteda dfm join」搜索結果命中「concatenate dfm matrices in 'quanteda' package」我正在使用rbind.dfmSparse? ??函數來加入這些結果。

到目前爲止,從我所知道的一切看起來都很正確。以爲我會反彈這個SO社區的遊戲計劃,看看我是否忽略了一個更有效的途徑來達到這個結果,或者到目前爲止我已經到達的解決方案中存在一些缺陷。

corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n", 
    "and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n", 
    "adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n")) 
corpusObject$documents 
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1)) 
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5) 
dfm1gramsSorted; dfm2to5grams 
#featnames(dfm1gramsSorted); featnames(dfm2to5grams) 
#colSums(dfm1gramsSorted); colSums(dfm2to5grams) 

dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted)) 
# option1 - select top 10 features from dfm1gramsSorted 
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10]) 
dfmTopTen1grams; featnames(dfmTopTen1grams) 
# option2 - drop all but top 10 features from dfm1gramsSorted 
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen]) 
dfmTopTen1grams; featnames(dfmTopTen1grams) 

dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams) 
dfmTopTen1gramsAndAll2to5grams; 
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams) 
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50], 
keep.rownames = F, stringsAsFactors = F) 

/EOQ

回答

1

用於提取前10 unigram進行,這種策略將工作得很好:

  1. 排序DFM通過(默認)的整體特徵頻率的遞減順序,其你已經完成了,但是然後在前10列中添加一個步驟。

  2. 使用cbind()(不是rbind()))將其與2至5克dfm結合。

應該這樣做:

dfmCombined <- cbind(dfm1gramsSorted[, 1:10], dfm2to5grams) 
head(dfmCombined, nfeat = 15) 
# Document-feature matrix of: 1 document, 195 features (0% sparse). 
# (showing first document and first 15 features) 
#  features 
# docs some corpus text of to very large top ten no some_corpus corpus_text text_of of_no no_consequence 
# text1 2  2 2 2 2 2  2 2 2 1   2   2  1  1    1 

你的示例代碼中包括一些使用data.table,雖然這並不在問題出現。在v0.99我們增加了新的功能textstat_frequency()其產生的「長」 /「整潔」頻率格式的data.frame可能會有所幫助:

head(textstat_frequency(dfmCombined), 10) 
#  feature frequency rank docfreq 
# 1   some   2 1  1 
# 2  corpus   2 2  1 
# 3   text   2 3  1 
# 4   of   2 4  1 
# 5   to   2 5  1 
# 6   very   2 6  1 
# 7  large   2 7  1 
# 8   top   2 8  1 
# 9   ten   2 9  1 
# 10 some_corpus   2 10  1 
+1

感謝您的詳細答覆。在具有相同特徵長度(即1/2/3/n-grams大小)和相同頻率的ngram的情況下,textstat_frequency()$ rank列是否有興趣打破關係? – myusrn

+0

我認爲它是隨機的 - 它使用'data.table :: setorder()'。 –