我只有一列的數據幀「文本」把所有的文字用語的數據頻率
"text"
"User Interfaces"
"Twitter"
"Text Normalization"
"Term weighting"
"Teenagers"
"Team member replacement"
我想借一個數據幀與每一個短語的頻率,像這樣:
"User Interfaces",1
"Twitter",1
"Text Normalization",1
"Term weighting",1
"Teenagers",1
"Team member replacement",1
爲了使我使用它:
library(tm)
df <- read.csv("C:/Users/acel/Desktop/myphr.csv", header=TRUE, sep=",")
corpusD <- Corpus(VectorSource(df$text))
corpusD <- tm_map(corpusD, tolower)
corpusD <- tm_map(corpusD, removeWords, stopwords('english'))
corpusD <- tm_map(corpusD, removeNumbers)
corpusD <- tm_map(corpusD, stripWhitespace)
corpusD <- tm_map(corpusD, PlainTextDocument)
corpusD <- tm_map(corpusD, stemDocument, language = "english")
corpusC <- Corpus(VectorSource(corpusD))
matrixD <- TermDocumentMatrix(corpusC)
matrixD <- removeSparseTerms(matrixD, 0.75)
MatrixDfreq <- rowSums(as.matrix(matrixD))
MatrixDfreq<-sort(MatrixDfreq, decreasing = TRUE)
MatrixDtop30<- MatrixDfreq [1:30]
,但是當我檢查從結果210我看到一個字似乎像user,1
和interface,1
而不是看到"user interface",1
任何想法爲什麼發生這種情況?
您的意思是***「生成短語的DTM而不是單詞」*** – smci