我的工作進展得順利的時候，但我遇到由於一些含有怪異的符號我的PDF文件的問題（「DY「§」）輸入無效「DY「§‘在’utf8towcs使用TM和pdftools

我審查了更早的討論，但沒有這些解決方案的工作： R tm package invalid input in 'utf8towcs'

這是到目前爲止我的代碼：

setwd("E:/OneDrive/Thesis/Received comments document/Consultation 50") 
getwd() 
library(tm) 
library(NLP) 
library(tidytext) 
library(dplyr) 
library(pdftools) 
files <- list.files(pattern = "pdf$") 
comments <- lapply(files, pdf_text) 
corp <- Corpus(VectorSource(comments)) 
corp <- VCorpus(VectorSource(comments));names(corp) <- files 
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation =  TRUE, 
                 stopwords = TRUE, 
                 tolower = TRUE, 
                 stemming = TRUE, 
                 removeNumbers = TRUE, 
                 bounds = list(global = c(3, Inf))))

結果：錯誤.tolower（TXT）：輸入無效「DY「§」在'utf8towcs'

inspect(Comments.tdm[1:32,]) 

ap_td <- tidy(Comments.tdm) 
write.csv(ap_td, file = "Terms 50.csv")

任何幫助，非常感謝。 ps，這段代碼完美工作在其他pdf上。

來源

2017-05-16 David van Oostveen

再看看前面的討論。該解決方案終於爲我工作：

myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))

記住遵循Fransisco的指示：「查德的解決方案並沒有爲我工作我有這個嵌入功能，它提供有關的iconv錯誤neededing一個向量作爲輸入因此，我決定在創建語料庫之前進行轉換。「

我的代碼現在看起來像這樣：

files <- list.files(pattern = "pdf$") 
comments <- lapply(files, pdf_text) 
comments <- sapply(comments, function(x) iconv(enc2utf8(x), sub = "byte")) 
corp <- Corpus(VectorSource(comments)) 

corp <- VCorpus(VectorSource(comments));names(corp) <- files 
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE, 
                 stopwords = TRUE, 
                 tolower = TRUE, 
                 stemming = TRUE, 
                 removeNumbers = TRUE, 
                 bounds = list(global = c(3, Inf)))) 

inspect(Comments.tdm[1:28,]) 

ap_td <- tidy(Comments.tdm) 
write.csv(ap_td, file = "Terms 44.csv")

來源

2017-05-18 08:47:26

輸入無效「DY「§‘在’utf8towcs使用TM和pdftools

結果：錯誤.tolower（TXT）：輸入無效「DY「§」在'utf8towcs'

回答

相關問題