2017-05-12 30 views
0

我只有一列的數據幀「文本」把所有的文字用語的數據頻率

"text" 
"User Interfaces" 
"Twitter" 
"Text Normalization" 
"Term weighting" 
"Teenagers" 
"Team member replacement" 

我想借一個數據幀與每一個短語的頻率,像這樣:

"User Interfaces",1 
"Twitter",1 
"Text Normalization",1 
"Term weighting",1 
"Teenagers",1 
"Team member replacement",1 

爲了使我使用它:

library(tm) 
df <- read.csv("C:/Users/acel/Desktop/myphr.csv", header=TRUE, sep=",") 
corpusD <- Corpus(VectorSource(df$text)) 
corpusD <- tm_map(corpusD, tolower) 
corpusD <- tm_map(corpusD, removeWords, stopwords('english')) 
corpusD <- tm_map(corpusD, removeNumbers) 
corpusD <- tm_map(corpusD, stripWhitespace) 
corpusD <- tm_map(corpusD, PlainTextDocument) 
corpusD <- tm_map(corpusD, stemDocument, language = "english") 
corpusC <- Corpus(VectorSource(corpusD)) 
matrixD <- TermDocumentMatrix(corpusC) 
matrixD <- removeSparseTerms(matrixD, 0.75) 
MatrixDfreq <- rowSums(as.matrix(matrixD)) 
MatrixDfreq<-sort(MatrixDfreq, decreasing = TRUE) 
MatrixDtop30<- MatrixDfreq [1:30] 

,但是當我檢查從結果210我看到一個字似乎像user,1interface,1而不是看到"user interface",1

任何想法爲什麼發生這種情況?

+1

您的意思是***「生成短語的DT​​M而不是單詞」*** – smci

回答

1

我認爲使用data.table操作會容易很多。

library(data.table) 
df = data.frame(text = c("test", "test" ,"test" , "test2", "test3", "test2")) 

> df 
    text 
1 test 
2 test 
3 test 
4 test2 
5 test3 
6 test2 

setDT(df) 
df = df[ , .(Number = .N), by = .(text)] 

> df 
    text Number 
1: test  3 
2: test2  2 
3: test3  1 

編輯

我們可以包括與此

library(data.table) 
library(SnowballC) 
df = data.frame(text = c("test", "testing" ,"test" , "test2", "test3", "test2")) 

> df 
    text 
1 test 
2 testing 
3 test 
4 test2 
5 test3 
6 test2 

df$text = wordStem(df$text, language = "porter") 

> df 
    text 
1 test 
2 test 
3 test 
4 test2 
5 test3 
6 test2 

setDT(df) 
df = df[ , .(Number = .N), by = .(text)] 

> df 
    text Number 
1: test  3 
2: test2  2 
3: test3  1 
+0

謝謝。非常好的解決方法,但我想要有詞幹,當我嘗試使用以下語句將corpusD轉換爲dataframe:'data.frame(text = sapply(corpusD,as.character),stringsAsFactors = FALSE)'我收到此錯誤'錯誤in(函數(...,row.names = NULL,check.rows = FALSE,check.names = TRUE,: 參數意味着不同的行數:6,7'這就是爲什麼我試圖從tm – Keri

+0

@Keri我沒有任何使用該軟件包的經驗,另一種方法是運行'library(SnowballC); wordStem(df $ text,language =「porter」)' – Kristofersen

+0

顯然將它保存到df,但是這會幹掉每個單詞在data.frame – Kristofersen

1

在這個例子中輸出你擁有了它並不像你對文本進行任何轉換,如詞幹降低或刪除停用詞,並保持短語原樣?如果是這樣,您可以使用tidyverse輕鬆計算唯一短語的數量。

library(dplyr) 
library(readr) 

df <- data_frame(text = c("User Interfaces", "Twitter", "Text Normalization", "Term weighting", "Teenagers", "Team member replacement") 
count(df, text) 
        text  n 
        <chr> <int> 
1 Team member replacement  1 
2    Teenagers  1 
3   Term weighting  1 
4     text  1 
5  Text Normalization  1 
6     Twitter  1 
7   User Interfaces  1 

text_df <- read_csv("C:/Users/acel/Desktop/myphr.csv") 
count(text_df, text, sort = TRUE) 

如果您需要在文本看stringrtidytext包執行轉換。