1
我想根據特定的網頁創建至少出現兩次的單詞列表。 我成功地獲取數據並獲得每個單詞計數的列表,但 我需要保留具有大寫字母的單詞以保持這種方式。現在,代碼僅生成帶有小寫字母的單詞列表。 例如,「邁阿密」一詞變成「邁阿密」,而我需要它作爲「邁阿密」。如何在textmining時保留單詞的原始結構
我怎樣才能得到他們的原始結構的話?
附上代碼:
library(XML)
web_page <- htmlTreeParse("http://www.larryslist.com/artmarket/the-talks/dennis-scholls-multiple-roles-from-collecting-art-to-winning-emmy-awards/"
,useInternal = TRUE)
doctext = unlist(xpathApply(web_page, '//p', xmlValue))
doctext = gsub('\\n', ' ', doctext)
doctext = paste(doctext, collapse = ' ')
library(tm)
SampCrps<- Corpus(VectorSource(doctext))
corp <- tm_map(SampCrps, PlainTextDocument)
oz <- tm_map(corp, removePunctuation, preserve_intra_word_dashes = FALSE) # remove punctuation
oz <- tm_map(corp, removeWords, stopwords("english")) # remove stopwords
dtm <-DocumentTermMatrix(oz)
findFreqTerms(dtm,2) # words that apear at least 2 times
dtmMatrix <- as.matrix(dtm)
wordsFreq <- colSums(dtmMatrix)
wordsFreq <- sort(wordsFreq, decreasing=TRUE)
head(wordsFreq)
wordsFreq <- as.data.frame(wordsFreq)
wordsFreq <- data.frame(word = rownames(wordsFreq), count = wordsFreq, row.names = NULL)
head(wordsFreq,50)
,當我使用此代碼行獲得一個三個字的ngram出現同樣的問題:
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(oz, control = list(tokenize = BigramTokenizer))
inspect(tdm)
非常感謝@Ken Benoit。 package quanteda似乎很棒。 – mql4beginner