2013-12-21 41 views
0

我有一個程序,使我可以使用R將pdf文件轉換爲txt文件。如何將此程序應用到我想要轉換的pdf文件的目錄到TXT文件?在R目錄中的文件列表上執行腳本

這是我迄今,僅適用於連接到一個PDF文檔的單個URL代碼:URL的

# download pdftotxt from 
# ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-win-3.03.zip 
# and extract to your program files folder 

# here is a pdf for mining 
url <- "http://www.noisyroom.net/blog/RomneySpeech072912.pdf" 
dest <- tempfile(fileext = ".pdf") 
download.file(url, dest, mode = "wb") 

# set path to pdftotxt.exe and convert pdf to text 
exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe" 
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F) 

# get txt-file name and open it 
filetxt <- sub(".pdf", ".txt", dest) 
shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error.. 


# do something with it, i.e. a simple word cloud 
library(tm) 
library(wordcloud) 
library(Rstem) 

txt <- readLines(filetxt) # don't mind warning.. 

txt <- tolower(txt) 
txt <- removeWords(txt, c("\\f", stopwords())) 

corpus <- Corpus(VectorSource(txt)) 
corpus <- tm_map(corpus, removePunctuation) 
tdm <- TermDocumentMatrix(corpus) 
m <- as.matrix(tdm) 
d <- data.frame(freq = sort(rowSums(m), decreasing = TRUE)) 

# Stem words 
d$stem <- wordStem(row.names(d), language = "english") 

# and put words to column, otherwise they would be lost when aggregating 
d$word <- row.names(d) 

# remove web address (very long string): 
d <- d[nchar(row.names(d)) < 20, ] 

# aggregate freqeuncy by word stem and 
# keep first words.. 
agg_freq <- aggregate(freq ~ stem, data = d, sum) 
agg_word <- aggregate(word ~ stem, data = d, function(x) x[1]) 

d <- cbind(freq = agg_freq[, 2], agg_word) 

# sort by frequency 
d <- d[order(d$freq, decreasing = T), ] 

# print wordcloud: 
wordcloud(d$word, d$freq) 

# remove files 
file.remove(dir(tempdir(), full.name=T)) # remove files 
+1

lapply和list.files? – Thomas

+2

在這裏有幾個線程在這裏。這跟你之後的事很接近。 http://stackoverflow.com/questions/20083454/run-every-file-in-a-folder/20083517#20083517你應該把你的腳本變成一個函數並把它傳遞給'sapply'。 –

+0

@RomanLuštrik感謝您的提示!你如何將這種方法應用到文件的目錄而不是URL的矢量? – stochastiq

回答

4

如果你有列表(實際上是一個向量)到你想要的文件處理,你可以把你的程序變成一個函數,並將這個程序應用到每一個url。嘗試的東西沿着線:

crawlPDFs <- function(x) { 
    # x is a character string to the url on the web 
    url <- x 
    dest <- tempfile(fileext = ".pdf") 
    download.file(url, dest, mode = "wb") 

    # set path to pdftotxt.exe and convert pdf to text 
    exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe" 
    system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F) 

    # get txt-file name and open it 
    filetxt <- sub(".pdf", ".txt", dest) 
    shell.exec(filetxt); shell.exec(filetxt) # strangely the first try always throws an error.. 


    # do something with it, i.e. a simple word cloud 
    library(tm) 
    library(wordcloud) 
    library(Rstem) 

    txt <- readLines(filetxt) # don't mind warning.. 

    txt <- tolower(txt) 
    txt <- removeWords(txt, c("\\f", stopwords())) 

    corpus <- Corpus(VectorSource(txt)) 
    corpus <- tm_map(corpus, removePunctuation) 
    tdm <- TermDocumentMatrix(corpus) 
    m <- as.matrix(tdm) 
    d <- data.frame(freq = sort(rowSums(m), decreasing = TRUE)) 

    # Stem words 
    d$stem <- wordStem(row.names(d), language = "english") 

    # and put words to column, otherwise they would be lost when aggregating 
    d$word <- row.names(d) 

    # remove web address (very long string): 
    d <- d[nchar(row.names(d)) < 20, ] 

    # aggregate freqeuncy by word stem and 
    # keep first words.. 
    agg_freq <- aggregate(freq ~ stem, data = d, sum) 
    agg_word <- aggregate(word ~ stem, data = d, function(x) x[1]) 

    d <- cbind(freq = agg_freq[, 2], agg_word) 

    # sort by frequency 
    d <- d[order(d$freq, decreasing = T), ] 

    # print wordcloud: 
    wordcloud(d$word, d$freq) 

    # remove files 
    file.remove(dir(tempdir(), full.name=T)) # remove files 
} 

sapply(list.of.urls, FUN = crawlPDFs) 

list.of.urls可以是一個特徵向量或列表,其中每個列表元素是一個字符,一個網址爲PDF格式。