如何將自定義函數應用於量子文集

我正嘗試將使用tm的腳本遷移到量子。閱讀量子文檔有一個關於應用「下游」變化的原理，以便原始語料庫不變。好。如何將自定義函數應用於量子文集

我以前寫過一個腳本來查找我們的tm語料庫中的拼寫錯誤，並得到了我們團隊的支持以創建手動查找。所以，我有一個包含2列的csv文件，第一列是拼寫錯誤術語，第二列是該術語的正確版本。

利用TM包之前我這樣做：

# Write a custom function to pass to tm_map 
# "Spellingdoc" is the 2 column csv 
library(stringr) 
library(stringi) 
library(tm) 
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))

然後我TM語料庫轉換我這樣做內：

mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))

什麼是這個自定義功能應用到我的quanteda語料庫equivilent方式？

來源

2017-08-30 Doug Fir

不可能知道這是否會從你的榜樣，這讓一些地區失去工作，但一般：

如果您要訪問的quanteda語料文本，你可以使用texts()，和以取代那些文本，texts()<-。

你的情況

因此，假設mycorpus是TM語料庫，你可以這樣做：

library("quanteda") 
stringi_spelling_update2 <- function(x, lut = spellingdoc) { 
    stringi::stri_replace_all_regex(str = x, 
            pattern = paste0("\\b", lut[,1], "\\b"), 
            replacement = lut[,2], 
            vectorize_all = FALSE) 
} 

myquantedacorpus <- corpus(mycorpus) 
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)

來源

2017-08-30 16:05:30

嗨@Ken，實際上mycorpus是quanteda語料庫。我剛剛正在學習這個軟件包。我想你的第二句話是我在找什麼？然而，對於這個特殊的問題，我注意到你爲dfm（）提供的字典功能，所以我用它來代替，但很好的知道，如果我需要對每個文檔應用自定義函數，我會去'''texts（mycorpus）< - myCustomFunction（myCorpus））'''（儘管如果堅持量化不改變語料庫的哲學，我應該避免這樣做） –

語料庫中的清理文本仍然與** quanteda **的非破壞性工作流原則一致，如果語料庫包含您從未感興趣的拼寫錯誤（例如OCR錯誤的產品）。我們想要阻止的是應用stemmers或從語料庫本身中刪除停用詞的人。 –

我想我通過here找到了間接答案。

texts(myCorpus) <- myFunction(myCorpus)

來源

2017-08-30 08:49:26

如何將自定義函數應用於量子文集

回答

相關問題