2015-12-22 100 views
0

我一直在試圖關注Norbert Ryciak的這個example,他們之前無法聯繫到此人。R - 維基百科文章的自動分類

由於這篇文章是在2014年撰寫的,R中的一些內容已經發生了變化,所以我已經能夠更新代碼中的一些內容,但是我陷入了最後一部分。

這是迄今爲止我工作的代碼:

library(tm) 
library(stringi) 
library(proxy) 

wiki <- "https://en.wikipedia.org/wiki/" 

titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative", 
    "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko", 
    "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien") 

articles <- character(length(titles)) 

for (i in 1:length(titles)) { 
    articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ") 
    } 

docs <- Corpus(VectorSource(articles)) 

docs[[1]] 
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " ")) 
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " ")) 
docs4 <- tm_map(docs3, PlainTextDocument) 
docs5 <- tm_map(docs4, stripWhitespace) 
docs6 <- tm_map(docs5, removeWords, stopwords("english")) 
docs7 <- tm_map(docs6, removePunctuation) 
docs8 <- tm_map(docs7, content_transformer(tolower)) 
docs8[[1]] 

docsTDM <- TermDocumentMatrix(docs8) 
docsTDM2 <- as.matrix(docsTDM) 
docsdissim <- dist(docsTDM2, method = "cosine") 

但我還沒有能夠得到通過這一部分:

docsdissim2 <- as.matrix(docsdissim) 
rownames(docsdissim2) <- titles 
colnames(docsdissim2) <- titles 
docsdissim2 
h <- hclust(docsdissim, method = "ward.D") 
plot(h, labels = titles, sub = "") 

我試圖直接運行「hclust」,然後我能夠繪製,但沒有可讀性出來。

這是即時得到錯誤:

rownames(docsdissim2) <- titles 
Error in `rownames<-`(`*tmp*`, value = c("Integral", "Riemann_integral", : 
    length of 'dimnames' [1] not equal to array extent 

另:

plot(h, labels = titles, sub = "") 
Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, : 
    invalid dendrogram input 

是否有任何人可以給我一隻手來完成這個例子嗎?

最好的問候,

回答

1

我能解決這個問題,感謝諾伯特Ryciak(本教程的作者)。

由於他使用了舊版本的「tm」(這可能是當時最新的版本),因此與我使用的版本不兼容。

解決方案是用「docsTDM <-DocumentTermMatrix(docs8)」替換「docsTDM < - TermDocumentMatrix(docs8)」。

所以最終代碼:

library(tm) 
library(stringi) 
library(proxy) 

wiki <- "https://en.wikipedia.org/wiki/" 

titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative", 
    "Limit_of_a_sequence", "Edvard_Munch", "Vincent_van_Gogh", "Jan_Matejko", 
    "Lev_Tolstoj", "Franz_Kafka", "J._R._R._Tolkien") 

articles <- character(length(titles)) 

for (i in 1:length(titles)) { 
    articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i])), col =  " ") 
    } 

docs <- Corpus(VectorSource(articles)) 

docs[[1]] 
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " ")) 
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " ")) 
docs4 <- tm_map(docs3, PlainTextDocument) 
docs5 <- tm_map(docs4, stripWhitespace) 
docs6 <- tm_map(docs5, removeWords, stopwords("english")) 
docs7 <- tm_map(docs6, removePunctuation) 
docs8 <- tm_map(docs7, content_transformer(tolower)) 
docs8[[1]] 

docsTDM <- DocumentTermMatrix(docs8) 
docsTDM2 <- as.matrix(docsTDM) 
docsdissim <- dist(docsTDM2, method = "cosine") 

docsdissim2 <- as.matrix(docsdissim) 
rownames(docsdissim2) <- titles 
colnames(docsdissim2) <- titles 
docsdissim2 
h <- hclust(docsdissim, method = "ward") 
plot(h, labels = titles, sub = "")