爲了在處理非常大的語料庫樣本時節省內存空間,我期待僅取前10個1grams並將其與所有2至5grams結合以形成我的單一的quanteda :: dfmSparse對象將用於自然語言處理[nlp]預測。對所有1克進行操作將毫無意義,因爲只有前十名[或二十名]才能使用我正在使用的簡單後退模型。加入quanteda dfm前十個1克與所有dfm 2至5克
我無法找到指示它只返回頂部##特徵的quanteda :: dfm(corpusText,...)參數。所以根據包作者@KenB在其他線程中的評論,我使用dfm_select/remove函數來提取前十個1grams,並基於「quanteda dfm join」搜索結果命中「concatenate dfm matrices in 'quanteda' package」我正在使用rbind.dfmSparse? ??函數來加入這些結果。
到目前爲止,從我所知道的一切看起來都很正確。以爲我會反彈這個SO社區的遊戲計劃,看看我是否忽略了一個更有效的途徑來達到這個結果,或者到目前爲止我已經到達的解決方案中存在一些缺陷。
corpusObject <- quanteda::corpus(paste("some corpus text of no consequence that in practice is going to be very large\n",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten\n",
"adding some corpus text word repeats to ensure 1gram top ten selection approaches are working\n"))
corpusObject$documents
dfm1gramsSorted <- dfm_sort(dfm(corpusObject, tolower = T, stem = F, ngrams = 1))
dfm2to5grams <- quanteda::dfm(corpusObject, tolower = T, stem = F, ngrams = 2:5)
dfm1gramsSorted; dfm2to5grams
#featnames(dfm1gramsSorted); featnames(dfm2to5grams)
#colSums(dfm1gramsSorted); colSums(dfm2to5grams)
dfm1gramsSortedLen <- length(featnames(dfm1gramsSorted))
# option1 - select top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_select(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[1:10])
dfmTopTen1grams; featnames(dfmTopTen1grams)
# option2 - drop all but top 10 features from dfm1gramsSorted
dfmTopTen1grams <- dfm_remove(dfm1gramsSorted, pattern = featnames(dfm1gramsSorted)[11:dfm1gramsSortedLen])
dfmTopTen1grams; featnames(dfmTopTen1grams)
dfmTopTen1gramsAndAll2to5grams <- rbind(dfmTopTen1grams, dfm2to5grams)
dfmTopTen1gramsAndAll2to5grams;
#featnames(dfmTopTen1gramsAndAll2to5grams); colSums(dfmTopTen1gramsAndAll2to5grams)
data.table(ngram = featnames(dfmTopTen1gramsAndAll2to5grams)[1:50], frequency = colSums(dfmTopTen1gramsAndAll2to5grams)[1:50],
keep.rownames = F, stringsAsFactors = F)
/EOQ
感謝您的詳細答覆。在具有相同特徵長度(即1/2/3/n-grams大小)和相同頻率的ngram的情況下,textstat_frequency()$ rank列是否有興趣打破關係? – myusrn
我認爲它是隨機的 - 它使用'data.table :: setorder()'。 –