我使用pdftools
從pdf中提取了文本,並將結果保存爲txt。將兩列文本文檔轉換爲單行文本挖掘
有沒有一種有效的方法來將2列的txt轉換爲一列的文件。
這是什麼,我有一個例子:
Alice was beginning to get very into the book her sister was reading,
tired of sitting by her sister but it had no pictures or conversations
on the bank, and of having nothing in it, `and what is the use of a book,'
to do: once or twice she had peeped thought Alice `without pictures or conversation?`
的
Alice was beginning to get very tired of sitting by her sister on the bank, and
of having nothing to do: once or twice she had peeped into the book her sister was
reading, but it had no pictures or conversations in it, `and what is the use of a
book,' thought Alice `without pictures or conversation?'
,而不是基於Extract Text from Two-Column PDF with R我修改的功能位獲得:
library(readr)
trim = function (x) gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", x, perl=TRUE)
QTD_COLUMNS = 2
read_text = function(text) {
result = ''
#Get all index of " " from page.
lstops = gregexpr(pattern =" ",text)
#Puts the index of the most frequents ' ' in a vector.
stops = as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
#Slice based in the specified number of colums (this can be improved)
for(i in seq(1, QTD_COLUMNS, by=1))
{
temp_result = sapply(text, function(x){
start = 1
stop =stops[i]
if(i > 1)
start = stops[i-1] + 1
if(i == QTD_COLUMNS)#last column, read until end.
stop = nchar(x)+1
substr(x, start=start, stop=stop)
}, USE.NAMES=FALSE)
temp_result = trim(temp_result)
result = append(result, temp_result)
}
result
}
txt = read_lines("alice_in_wonderland.txt")
result = ''
for (i in 1:length(txt)) {
page = txt[i]
t1 = unlist(strsplit(page, "\n"))
maxSize = max(nchar(t1))
t1 = paste0(t1,strrep(" ", maxSize-nchar(t1)))
result = append(result,read_text(t1))
}
result
但是,沒有運氣與一些文件。我想知道是否有一個更一般/更好的正則表達式來實現結果。
非常感謝提前!
我很想找到一個非PDF的選擇。如果你想使用那個特定的故事,這裏有一個純文本版本:http://www.gutenberg.org/files/11/11-0.txt。否則,尋找另一個PDF到文本轉換工具,它將轉換爲1列輸出。 – neilfws
看起來像一個固定寬度的文件 - 如果在兩列中總是有恆定的寬度,''dat < - read.fwf(file,widths = c(37,48),stringsAsFactors = FALSE)'會給你一個很好的開始。 – thelatemail
[保存我的理智](https://www.nu42.com/2014/09/scraping-pdf-documents-without-losing.html)意識到'pdftohtml'具有非常有用的XML輸出模式。 –