R - 維基百科文章的自動分類

我一直在試圖關注Norbert Ryciak的這個example，他們之前無法聯繫到此人。R - 維基百科文章的自動分類

由於這篇文章是在2014年撰寫的，R中的一些內容已經發生了變化，所以我已經能夠更新代碼中的一些內容，但是我陷入了最後一部分。

這是迄今爲止我工作的代碼：

library(tm) 
library(stringi) 
library(proxy) 

wiki <- "https://en.wikipedia.org/wiki/" 

titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative", 
    "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko", 
    "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien") 

articles <- character(length(titles)) 

for (i in 1:length(titles)) { 
    articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ") 
    } 

docs <- Corpus(VectorSource(articles)) 

docs[[1]] 
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " ")) 
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " ")) 
docs4 <- tm_map(docs3, PlainTextDocument) 
docs5 <- tm_map(docs4, stripWhitespace) 
docs6 <- tm_map(docs5, removeWords, stopwords("english")) 
docs7 <- tm_map(docs6, removePunctuation) 
docs8 <- tm_map(docs7, content_transformer(tolower)) 
docs8[[1]] 

docsTDM <- TermDocumentMatrix(docs8) 
docsTDM2 <- as.matrix(docsTDM) 
docsdissim <- dist(docsTDM2, method = "cosine")

但我還沒有能夠得到通過這一部分：

docsdissim2 <- as.matrix(docsdissim) 
rownames(docsdissim2) <- titles 
colnames(docsdissim2) <- titles 
docsdissim2 
h <- hclust(docsdissim, method = "ward.D") 
plot(h, labels = titles, sub = "")

我試圖直接運行「hclust」，然後我能夠繪製，但沒有可讀性出來。

這是即時得到錯誤：

rownames(docsdissim2) <- titles 
Error in `rownames<-`(`*tmp*`, value = c("Integral", "Riemann_integral", : 
    length of 'dimnames' [1] not equal to array extent

另：

plot(h, labels = titles, sub = "") 
Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, : 
    invalid dendrogram input

是否有任何人可以給我一隻手來完成這個例子嗎？

最好的問候，

來源

2015-12-22 tomcontr

我能解決這個問題，感謝諾伯特Ryciak（本教程的作者）。

由於他使用了舊版本的「tm」（這可能是當時最新的版本），因此與我使用的版本不兼容。

解決方案是用「docsTDM <-DocumentTermMatrix（docs8）」替換「docsTDM < - TermDocumentMatrix（docs8）」。

所以最終代碼：

library(tm) 
library(stringi) 
library(proxy) 

wiki <- "https://en.wikipedia.org/wiki/" 

titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative", 
    "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko", 
    "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien") 

articles <- character(length(titles)) 

for (i in 1:length(titles)) { 
    articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col =  " ") 
    } 

docs <- Corpus(VectorSource(articles)) 

docs[[1]] 
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " ")) 
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " ")) 
docs4 <- tm_map(docs3, PlainTextDocument) 
docs5 <- tm_map(docs4, stripWhitespace) 
docs6 <- tm_map(docs5, removeWords, stopwords("english")) 
docs7 <- tm_map(docs6, removePunctuation) 
docs8 <- tm_map(docs7, content_transformer(tolower)) 
docs8[[1]] 

docsTDM <- DocumentTermMatrix(docs8) 
docsTDM2 <- as.matrix(docsTDM) 
docsdissim <- dist(docsTDM2, method = "cosine") 

docsdissim2 <- as.matrix(docsdissim) 
rownames(docsdissim2) <- titles 
colnames(docsdissim2) <- titles 
docsdissim2 
h <- hclust(docsdissim, method = "ward") 
plot(h, labels = titles, sub = "")

來源

2016-01-04 15:54:18 tomcontr

R - 維基百科文章的自動分類

回答

相關問題