2013-08-22 25 views
1

我知道我可以使用字典功能使用TM包來算的特定詞的出現在語料:如何在TermDocumentMatrix中使用正則表達式進行文本挖掘?

require(tm) 
data(crude) 

dic <- Dictionary("crude") 
tdm <- TermDocumentMatrix(crude, control = list(dictionary = dic, removePunctuation = TRUE)) 
inspect(tdm) 

我想知道是否有一個設施,而不是提供一個正則表達式字典而不是一個固定的詞?

有時制止可能不是我想要的東西(例如我可能要拿起拼寫錯誤),所以我想這樣做:

dic <- Dictionary(c("crude", 
        "\\bcrud[[:alnum:]]+"), 
        "\\bcrud[de]") 

,從而繼續使用TM的設施包?

回答

3

我不確定是否可以在字典函數中放置正則表達式,因爲它只接受字符向量或術語文檔矩陣。該工作圍繞我使用正則表達式來子集術語文檔矩陣的條款建議,然後做字數:

# What I would do instead 
tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE)) 
# subset the tdm according to the criteria 
# this is where you can use regex 
crit <- grep("cru", tdm$dimnames$Terms) 
# have a look to see what you got 
inspect(tdm[crit]) 
     A term-document matrix (2 terms, 20 documents) 

    Non-/sparse entries: 10/30 
    Sparsity   : 75% 
    Maximal term length: 7 
    Weighting   : term frequency (tf) 

      Docs 
    Terms  127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 
     crucial 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 
     crude  2 0 2 3 0 2 0 0 0 0 5 2 0 2 0 0 0 2 
      Docs 
    Terms  704 708 
     crucial 0 0 
     crude  0 1 
# and count the number of times that criteria is met in each doc 
colSums(as.matrix(tdm[crit])) 
127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708 
    2 0 2 3 0 2 2 0 0 0 5 2 0 2 0 0 0 2 0 1 
# count the total number of times in all docs 
sum(colSums(as.matrix(tdm[crit]))) 
[1] 23 

如果這不是你想要的,繼續前進,編輯你的問題是什麼包括一些正確代表您實際使用情況的示例數據,以及您希望的輸出示例。

2

如果指定valuetype = "regex",文本分析包quanteda允許使用正則表達式進行特徵選擇。

require(tm) 
require(quanteda) 
data(crude) 

dfm(corpus(crude), keptFeatures = "^cru", valuetype = "regex", verbose = FALSE) 
# Document-feature matrix of: 20 documents, 2 features. 
# 20 x 2 sparse Matrix of class "dfmSparse" 
#  features 
# docs crude crucial 
# 127  2  0 
# 144  0  0 
# 191  2  0 
# 194  3  0 
# 211  0  0 
# 236  2  0 
# 237  0  2 
# 242  0  0 
# 246  0  0 
# 248  0  0 
# 273  5  0 
# 349  2  0 
# 352  0  0 
# 353  2  0 
# 368  0  0 
# 489  0  0 
# 502  0  0 
# 543  2  0 
# 704  0  0 
# 708  1  0 

另請參閱?selectFeatures