0
我一直在試圖關注Norbert Ryciak的這個example,他們之前無法聯繫到此人。R - 維基百科文章的自動分類
由於這篇文章是在2014年撰寫的,R中的一些內容已經發生了變化,所以我已經能夠更新代碼中的一些內容,但是我陷入了最後一部分。
這是迄今爲止我工作的代碼:
library(tm)
library(stringi)
library(proxy)
wiki <- "https://en.wikipedia.org/wiki/"
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative",
"Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko",
"Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien")
articles <- character(length(titles))
for (i in 1:length(titles)) {
articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")
}
docs <- Corpus(VectorSource(articles))
docs[[1]]
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))
docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, content_transformer(tolower))
docs8[[1]]
docsTDM <- TermDocumentMatrix(docs8)
docsTDM2 <- as.matrix(docsTDM)
docsdissim <- dist(docsTDM2, method = "cosine")
但我還沒有能夠得到通過這一部分:
docsdissim2 <- as.matrix(docsdissim)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
docsdissim2
h <- hclust(docsdissim, method = "ward.D")
plot(h, labels = titles, sub = "")
我試圖直接運行「hclust」,然後我能夠繪製,但沒有可讀性出來。
這是即時得到錯誤:
rownames(docsdissim2) <- titles
Error in `rownames<-`(`*tmp*`, value = c("Integral", "Riemann_integral", :
length of 'dimnames' [1] not equal to array extent
另:
plot(h, labels = titles, sub = "")
Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
invalid dendrogram input
是否有任何人可以給我一隻手來完成這個例子嗎?
最好的問候,