R tm package strptime（d，fmt）中的readPDF錯誤：輸入字符串太長

我想使用tm軟件包對this website上的文件進行文本挖掘。我使用下面的代碼來下載文件之一（即abell.pdf）到我的工作目錄，並試圖存儲內容：R tm package strptime（d，fmt）中的readPDF錯誤：輸入字符串太長

library("tm") 
url <- "https://baltimore2006to2010acsprofiles.files.wordpress.com/2014/07/abell.pdf" 
filename <- "abell.pdf" 
download.file(url = url, destfile = filename, method = "curl") 

doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename), 
               language = "en", id = "id1")

但我收到以下錯誤和警告：

Error in strptime(d, fmt) : input string is too long 
In addition: Warning messages: 
1: In grepl(re, lines) : input string 1 is invalid in this locale 
2: In grepl(re, lines) : input string 2 is invalid in this locale

pdfs不是特別長（5頁，978 KB），我已經能夠成功地使用readPDF函數來讀取我Mac OSX上的其他pdf文件。我最需要的信息（2010年人口普查總人口數量）位於每個pdf的第一頁，因此我嘗試將pdf縮短至第一頁，但我得到的信息相同。

我是新來的tm包，所以我很抱歉，如果我失去了明顯的東西。任何幫助是極大的讚賞！

來源

2016-04-22 Maxwell

根據我讀過的內容，這個錯誤與「readPDF」函數試圖爲您導入的文件創建元數據的方式有關。無論如何，您可以使用「信息」選項更改元數據信息。例如，我通常（使用您的代碼）通過以下方式修改命令繞過這個錯誤：

doc <- readPDF(control = list(info="-f",text = "-layout"))(elem = list(uri = filename),language = "en", id = "id1")

凡加入「信息=」 - F「」是唯一的變化。這並不真正「解決」問題，但它繞過了錯誤。歡呼:)

來源

2016-11-15 21:25:20 Danny

R tm package strptime（d，fmt）中的readPDF錯誤：輸入字符串太長

回答

相關問題