用字母創建dfm的單詞

我正在嘗試從字符串創建dfm的單詞。當dfm無法選擇時，我面臨的問題是可以爲諸如「/」「 - 」「之類的標點創建功能。」要麼 '。用字母創建dfm的單詞

require(quanteda) 
dict = c('a','b','c','d','e','f','/',".",'-',"'") 
dict <- quanteda::dictionary(sapply(dict, list)) 

x<-c("cab","baa", "a/de-d/f","ad") 
x<-sapply(x, function(x) strsplit(x,"")[[1]]) 
x<-sapply(x, function(x) paste(x, collapse = " ")) 

mat <- dfm(x, dictionary = dict, valuetype = "regex") 
mat <- as.matrix(mat) 
mat

對於「A /解d/F」，我想捕捉的字母「/」，「 - 」太
爲什麼「」功能作爲一個rowsum。我怎樣才能保持它作爲個人功能？

來源

2016-11-20 SuperSatya

Like'tokens < - tokenize（x，what =「character」）; mat < - dfm（tokens，dictionary = dict，valuetype =「fixed」）'？在正則表達式（「正則表達式」）中，「。」代表任何字符。 – lukeA

謝謝。這正是我所期待的。 – SuperSatya

問題（如@lukeA在評論中指出的）是您的valuetype正在使用錯誤的模式匹配。你正在使用一個正則表達式，其中.代表任何字符，因此這裏給你一個總數（你稱之爲rowsum）。

我們首先看x，它將在空白處被標記爲dfm()，以便每個字符變成一個標記。

x 
#  cab    baa   a/de-d/f    ad 
# "c a b"   "b a a" "a/d e - d/f"    "a d"

要回答（2）第一，你得到一個「正則表達式」匹配如下：

dfm(x, dictionary = dict, valuetype = "regex", verbose = FALSE) 
## Document-feature matrix of: 4 documents, 10 features. 
## 4 x 10 sparse Matrix of class "dfmSparse" 
##   features 
## docs  a b c d e f/. - ' 
## cab  1 1 1 0 0 0 0 3 0 0 
## baa  2 1 0 0 0 0 0 3 0 0 
## a/de-d/f 1 0 0 2 1 1 0 5 0 0 
## ad  1 0 0 1 0 0 0 2 0 0

這已經很接近，但不回答（1）。爲了解決這個問題，你需要改變dfm()的默認標記化行爲，這樣它就不會刪除標點符號。

dfm(x, dictionary = dict, valuetype = "fixed", removePunct = FALSE, verbose = FALSE) 
## Document-feature matrix of: 4 documents, 10 features. 
## 4 x 10 sparse Matrix of class "dfmSparse" 
##   features 
## docs  a b c d e f/. - ' 
## cab  1 1 1 0 0 0 0 0 0 0 
## baa  2 1 0 0 0 0 0 0 0 0 
## a/de-d/f 1 0 0 2 1 1 2 0 1 0 
## ad  1 0 0 1 0 0 0 0 0 0

現在/和-正在計數。 .和'仍然作爲功能存在，因爲它們是字典鍵，但每個文檔都有一個零計數。

來源

2016-11-20 14:51:35

謝謝。我已經用'valuetype =「fixed」'參數修正了它，而沒有removPunct。我想這不重要，因爲它無論如何都捕捉到所有的標點符號。 – SuperSatya

用字母創建dfm的單詞

回答

相關問題