我想從我建立的語料庫中刪除一些單詞,但它似乎沒有工作。我首先瀏覽所有內容並創建一個數據框,按照其頻率順序列出我的單詞。我使用這個列表來識別我不感興趣的單詞,然後嘗試用刪除的單詞創建一個新列表。但是,這些詞仍保留在我的數據集中。我想知道我做錯了什麼,以及爲什麼這些詞沒有被刪除?我已經包含下面的完整代碼:R tm removeWords函數不刪除單詞
install.packages("rvest")
install.packages("tm")
install.packages("SnowballC")
install.packages("stringr")
library(stringr)
library(tm)
library(SnowballC)
library(rvest)
# Pull in the data I have been using.
paperList <- html("http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192")
paperURLs <- paperList %>%
html_nodes(xpath="//*[@class='search-results-title']/a") %>%
html_attr("href")
paperURLs <- paste("http://journals.plos.org", paperURLs, sep = "")
paper_html <- sapply(1:length(paperURLs), function(x) html(paperURLs[x]))
paperText <- sapply(1:length(paper_html), function(x) paper_html[[1]] %>%
html_nodes(xpath="//*[@class='article-content']") %>%
html_text() %>%
str_trim(.))
# Create corpus
paperCorp <- Corpus(VectorSource(paperText))
for(j in seq(paperCorp))
{
paperCorp[[j]] <- gsub(":", " ", paperCorp[[j]])
paperCorp[[j]] <- gsub("\n", " ", paperCorp[[j]])
paperCorp[[j]] <- gsub("-", " ", paperCorp[[j]])
}
paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
paperCorp <- tm_map(paperCorp, stemDocument)
paperCorp <- tm_map(paperCorp, stripWhitespace)
paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)
dtm <- DocumentTermMatrix(paperCorpPTD)
termFreq <- colSums(as.matrix(dtm))
head(termFreq)
tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)
# After having identified words I am not interested in
# create new corpus with these words removed.
paperCorp1 <- tm_map(paperCorp, removeWords, c("also", "article", "Article",
"download", "google", "figure",
"fig", "groups","Google", "however",
"high", "human", "levels",
"larger", "may", "number",
"shown", "study", "studies", "this",
"using", "two", "the", "Scholar",
"pubmedncbi", "PubMedNCBI",
"view", "View", "the", "biol",
"via", "image", "doi", "one",
"analysis"))
paperCorp1 <- tm_map(paperCorp1, stripWhitespace)
paperCorpPTD1 <- tm_map(paperCorp1, PlainTextDocument)
dtm1 <- DocumentTermMatrix(paperCorpPTD1)
termFreq1 <- colSums(as.matrix(dtm1))
tf1 <- data.frame(term = names(termFreq1), freq = termFreq1)
tf1 <- tf1[order(-tf1[,2]),]
head(tf1, 100)
如果你通過tf1
你會發現,很多被指定爲被刪除並沒有真正被刪除的話。
想知道我做錯了什麼,以及我如何從數據中刪除這些單詞?
注意:removeWords
正在做某事,因爲head(tm, 100)
和head(tm1, 100)
的輸出不完全相同。所以removeWords
似乎刪除我想擺脫的話,但不是所有實例的一些實例。
您的代碼中存在拼寫錯誤。 'paperCorp1 < - tm_map(paperCorp,removeWords,c(「the」))'應該是'paperCorp1 < - tm_map(paperCorp1,removeWords,c(「the」))' – phiver
Hi @phiver,感謝您挑選它。當我試圖找出問題時,我不小心把它留下了。刪除該行代碼後,我仍然有同樣的問題。我試圖刪除的許多詞語,包括「the」,仍然在'tf1'中。 – Adam
這可能是因爲大寫字母。試試:'paperCorp < - tm_map(paperCorp,tolower)' – scoa