獲取語料庫中字符向量元素的數量

我有兩個字符向量。一個用積極的話語，一個用消極的話語。例如

pos <- c("good", "accomplished", "won", "happy") 
neg <- c("bad", "loss", "damaged", "sued", "disaster")

我現在有成千上萬的新聞文章的文集，我想知道每一篇文章，我載體POS和NEG的許多元素是如何在項目中。

例如（不知道的主體作用如何工作在這裏，但你的想法：有我的文集兩篇文章）

mycorpus <- Corpus("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")

我想是這樣的：

article 1: 2 element of pos and 0 element of neg 
article 2: 0 elements of pos, 2 elements of neg

另一個好東西會是的，如果我能得到每篇文章如下：

（POS的數量的話 - 的NEG詞數）/（在文章總字數）

非常感謝你！

編輯：

@ Victorp：這似乎並沒有工作

矩陣我得到很好看：

mytdm[1:6,1:10] 
       Docs 
Terms   1 2 3 4 5 6 7 8 9 10 
aaron   0 0 0 0 0 1 0 0 0 0 
abandon  1 1 0 0 0 0 0 0 0 0 
abandoned  0 0 0 3 0 0 0 0 0 0 
abbey   0 0 0 0 0 0 0 0 0 0 
abbott   0 0 0 0 0 0 0 0 0 0 
abbotts  0 0 1 0 0 0 0 0 0 0

但是當我做你的命令，我得到零每個文檔！

colSums(mytdm[rownames(mytdm) %in% pos, ]) 
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

這是爲什麼？

來源

2014-02-25 cptn

做'總和（rownames（mytdm）在％陽性％）'進行檢查，如果你有你的矩陣中的積極詞彙，這是一個完全匹配，所以詞語必須以相同的方式寫。 – Victorp

你好，你可以使用TermDocumentMatrix做這件事：

mycorpus <- Corpus(VectorSource(c("The CEO is happy that they finally won the case.", "The disaster caused a huge loss."))) 
mytdm <- TermDocumentMatrix(mycorpus, control=list(removePunctuation=TRUE)) 
mytdm <- as.matrix(mytdm) 

# Positive words 
colSums(mytdm[rownames(mytdm) %in% pos, ]) 
1 2 
2 0 

# Negative words 
colSums(mytdm[rownames(mytdm) %in% neg, ]) 
1 2 
0 2 

# Total number of words per documents 
colSums(mytdm) 
1 2 
9 5

來源

2014-02-25 15:00:17 Victorp

這裏的另一種方法：

## pos <- c("good", "accomplished", "won", "happy") 
## neg <- c("bad", "loss", "damaged", "sued", "disaster") 
## 
## mycorpus <- Corpus(VectorSource(
##  list("The CEO is happy that they finally won the case.", 
##  "The disaster caused a huge loss."))) 

library(qdap) 
with(tm_corpus2df(mycorpus), termco(text, docs, list(pos=pos, neg=neg))) 

## docs word.count  pos  neg 
## 1 1   10 2(20.00%)   0 
## 2 2   6   0 2(33.33%)

來源

2014-02-25 15:40:51

嘿泰勒，謝謝你的回答！ tm_corpus2df（mycorpus）是什麼意思？你用這個函數把語料庫轉換成數據框？和「文本」和「文檔」是我的列名我猜？ – cptn

是'tm_corpus2df（mycorpus）'轉換爲data.frame，就像qdap中使用的數據格式一樣。並且是「文本」和「文檔」是'tm_corpus2df'給出的名稱。如果你願意，你可以改變它們。 –

我試過它與我的數據框和列名稱在該命令，但我得到一個錯誤：錯誤在gsub（paste0（「。*？（$ |'|」，粘貼（paste0（「\\」，char.keep ），collapse =「|」），： assertion'tree-> num_tags == num_tags'在執行regexp失敗：文件'tre-compile.c'，第627行 - 我不知道這意味着什麼！ – cptn

獲取語料庫中字符向量元素的數量

回答

相關問題