2016-05-28 92 views
1

每當我嘗試檢查我的頻率時,我似乎遇到了問題。單詞和協會。空文檔矩陣

當我做了TDM我得到這樣的信息: TermDocumentMatrix

我可以看到我有很多術語的使用,在大量的文件。 但是!

當我嘗試檢查「TDM」的內容,我得到這樣的信息: Inspecting the TDM

Howcome的TDM突然是空的?

希望有人能幫助

tweets <- userTimeline("RDataMining", n = 1000) 

(n.tweet <- length(tweets)) 
tweets[1:3] 

#convert tweets to a data frame 
tweets.df <- twListToDF(tweets) 
dim(tweets.df) 


##Text cleaning 
library(tm) 
#build a corpus and specify the source to be a character vector 
myCorpus <- Corpus(VectorSource(tweets.df$text)) 

#convert to lower case 
myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 

#remove URLs 
removeURL <- function(x) gsub ("http[^[:space:]]*","",x) 
myCorpus <- tm_map(myCorpus,content_transformer(removeURL)) 

#remove anything other than English letters or space 
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*","",x) 
myCorpus <- tm_map(myCorpus,content_transformer(removeNumPunct)) 

#remove stopwords + 2 
myStopwords <- c(stopwords('english'),"available","via") 
#remove "r" and "big" from stopwords 
myStopwords <- setdiff(myStopwords, c("r","big")) 
#remove stopwords from corpus 
myCorpus <- tm_map(myCorpus,removeWords,myStopwords) 
#remove extra whitespace 
myCorpus <- tm_map(myCorpus, stripWhitespace) 

#keep a copy of corpus to use later as a dictionary for stem completion 
myCorpusCopy <- myCorpus 

#stem words 
library(SnowballC) 
myCorpus <- tm_map(myCorpus,stemDocument) 
stemCompletion2 <- function(x,dictionary) { 
x <- unlist(strsplit(as.character(x),"")) 

#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this 

x <- x[x !=""] 
x <- stemCompletion(x, dictionary = dictionary) 
x <- paste (x,sep = "",collapse = "") 
PlainTextDocument(stripWhitespace(x)) 
} 

myCorpus <- lapply(myCorpus, stemCompletion2, dictionary = myCorpusCopy) 
myCorpus <- Corpus(VectorSource(myCorpus)) 

#count freq of "mining" 
miningCases <- lapply(myCorpusCopy, 
        function(x) {grep(as.character(x),pattern = "\\<mining")}) 
sum(unlist(miningCases)) 

#count freq of "miner" 
miningCases <- lapply(myCorpusCopy, 
        function(x) {grep(as.character(x),pattern = "\\<miner")}) 
sum(unlist(miningCases)) 

#count freq of "r" 
miningCases <- lapply(myCorpusCopy, 
        function(x) {grep(as.character(x),pattern = "\\<r")}) 
sum(unlist(miningCases)) 

#replace "miner" with "mining" 
myCorpus <- tm_map(myCorpus,content_transformer(gsub), 
       pattern = "miner", replacement = "mining") 

tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation = TRUE,stopwords = TRUE)) 
tdm 

##Freq words and associations 
idx <- which(dimnames(tdm)$Terms == "r") 
inspect(tdm[idx + (0:5), 101:110]) 

#inspect frequent words 
(freq.terms <- findFreqTerms(tdm, lowfreq = 15)) 
term.freq <- rowSums(as.matrix(tdm)) 
term.freq <- subset(term.freq,term.freq >= 15) 
df <- data.frame(term = names(term.freq), freq = term.freq) 
+0

如果我使用:inspect(tdm)我得到一長串字符串。 –

回答

0

我一直在使用Twitter的如下查詢測試代碼:

tweets = searchTwitter("r data mining", n=10) 

,我認爲這個問題是與你的功能stemCompletion2,看起來應該這樣的事情:

stemCompletion2 <- function(x,dictionary) { 
    x <- unlist(strsplit(as.character(x)," ")) 
    print("before:") 
    print(x) 

    #because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this 
    x <- x[x !=""] 
    x <- stemCompletion(x, dictionary = dictionary) 
    print("after:") 
    print(x) 
    x <- paste(x, sep = " ") 
    PlainTextDocument(stripWhitespace(x)) 
} 

修改如下低點:你有

x <- unlist(strsplit(as.character(x),"")) 

這是創建與每個文件的所有字符列表之前,我已經將它修改爲

x <- unlist(strsplit(as.character(x)," ")) 

創建單詞列表。同樣,重新構圖您的文檔時,你在哪裏做

x <- paste (x,sep = "",collapse = "") 

這是創建您在您的文章提到的長字符串,我已經將它修改爲:

x <- paste(x, sep = " ") 

以重新字樣。完井的

一個例子將是我的數據:

[1] "before:" 
[1] "rt"    "ebookdealalert" "r"    "datamin"  "project"  "learn"   "data"   "mine"   
[9] "realworld"  "project"  "book"   "solv"   "predict"  "model"   
[1] "after:" 
       rt ebookdealalert     r   datamin   project    learn    data    mine 
      "rt" "ebookdealalerts"    "r"  "datamining"  "projects"   "learn"   "data"    "" 
     realworld   project    book    solv   predict    model 
     "realworld"  "projects"   "book"   "solve"  "predictive"  "modeling" 

該步驟之後,你可以預期與TermDocumentMatrix工作。

希望它有幫助。