2017-05-16 73 views
0

我的工作進展得順利的時候,但我遇到由於一些含有怪異的符號我的PDF文件的問題(「DY「§」)輸入無效「DY「§‘在’utf8towcs使用TM和pdftools

我審查了更早的討論,但沒有這些解決方案的工作: R tm package invalid input in 'utf8towcs'

這是到目前爲止我的代碼:

setwd("E:/OneDrive/Thesis/Received comments document/Consultation 50") 
getwd() 
library(tm) 
library(NLP) 
library(tidytext) 
library(dplyr) 
library(pdftools) 
files <- list.files(pattern = "pdf$") 
comments <- lapply(files, pdf_text) 
corp <- Corpus(VectorSource(comments)) 
corp <- VCorpus(VectorSource(comments));names(corp) <- files 
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation =  TRUE, 
                 stopwords = TRUE, 
                 tolower = TRUE, 
                 stemming = TRUE, 
                 removeNumbers = TRUE, 
                 bounds = list(global = c(3, Inf)))) 

結果:錯誤.tolower(TXT):輸入無效「DY「§」在'utf8towcs'

inspect(Comments.tdm[1:32,]) 

ap_td <- tidy(Comments.tdm) 
write.csv(ap_td, file = "Terms 50.csv") 

任何幫助,非常感謝。 ps,這段代碼完美工作在其他pdf上。

回答

0

再看看前面的討論。該解決方案終於爲我工作:

myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte")) 

記住遵循Fransisco的指示:「查德的解決方案並沒有爲我工作我有這個嵌入功能,它提供有關的iconv錯誤neededing一個向量作爲輸入因此,我決定在創建語料庫之前進行轉換。「

我的代碼現在看起來像這樣:

files <- list.files(pattern = "pdf$") 
comments <- lapply(files, pdf_text) 
comments <- sapply(comments, function(x) iconv(enc2utf8(x), sub = "byte")) 
corp <- Corpus(VectorSource(comments)) 

corp <- VCorpus(VectorSource(comments));names(corp) <- files 
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE, 
                 stopwords = TRUE, 
                 tolower = TRUE, 
                 stemming = TRUE, 
                 removeNumbers = TRUE, 
                 bounds = list(global = c(3, Inf)))) 

inspect(Comments.tdm[1:28,]) 

ap_td <- tidy(Comments.tdm) 
write.csv(ap_td, file = "Terms 44.csv")