2017-06-01 49 views
1

我使用pdftools從pdf中提取了文本,並將結果保存爲txt。將兩列文本文檔轉換爲單行文本挖掘

有沒有一種有效的方法來將2列的txt轉換爲一列的文件。

這是什麼,我有一個例子:

Alice was beginning to get very  into the book her sister was reading, 
tired of sitting by her sister  but it had no pictures or conversations 
on the bank, and of having nothing in it, `and what is the use of a book,' 
to do: once or twice she had peeped thought Alice `without pictures or conversation?` 

Alice was beginning to get very tired of sitting by her sister on the bank, and 
of having nothing to do: once or twice she had peeped into the book her sister was 
reading, but it had no pictures or conversations in it, `and what is the use of a 
book,' thought Alice `without pictures or conversation?' 

,而不是基於Extract Text from Two-Column PDF with R我修改的功能位獲得:

library(readr)  
trim = function (x) gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", x, perl=TRUE) 

QTD_COLUMNS = 2 

read_text = function(text) { 
    result = '' 
    #Get all index of " " from page. 
    lstops = gregexpr(pattern =" ",text) 
    #Puts the index of the most frequents ' ' in a vector. 
    stops = as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2])) 
    #Slice based in the specified number of colums (this can be improved) 
    for(i in seq(1, QTD_COLUMNS, by=1)) 
    { 
    temp_result = sapply(text, function(x){ 
     start = 1 
     stop =stops[i] 
     if(i > 1)    
     start = stops[i-1] + 1 
     if(i == QTD_COLUMNS)#last column, read until end. 
     stop = nchar(x)+1 
     substr(x, start=start, stop=stop) 
    }, USE.NAMES=FALSE) 
    temp_result = trim(temp_result) 
    result = append(result, temp_result) 
    } 
    result 
} 

txt = read_lines("alice_in_wonderland.txt") 

result = '' 

for (i in 1:length(txt)) { 
    page = txt[i] 
    t1 = unlist(strsplit(page, "\n"))  
    maxSize = max(nchar(t1)) 
    t1 = paste0(t1,strrep(" ", maxSize-nchar(t1))) 
    result = append(result,read_text(t1)) 
} 

result 

但是,沒有運氣與一些文件。我想知道是否有一個更一般/更好的正則表達式來實現結果。

非常感謝提前!

+0

我很想找到一個非PDF的選擇。如果你想使用那個特定的故事,這裏有一個純文本版本:http://www.gutenberg.org/files/11/11-0.txt。否則,尋找另一個PDF到文本轉換工具,它將轉換爲1列輸出。 – neilfws

+1

看起來像一個固定寬度的文件 - 如果在兩列中總是有恆定的寬度,''dat < - read.fwf(file,widths = c(37,48),stringsAsFactors = FALSE)'會給你一個很好的開始。 – thelatemail

+1

[保存我的理智](https://www.nu42.com/2014/09/scraping-pdf-documents-without-losing.html)意識到'pdftohtml'具有非常有用的XML輸出模式。 –

回答

0

與固定左寬的列,我們可以將每行分成前37個字符和其餘字符,將它們添加到左列和右列的字符串中。例如,使用正則表達式

use warnings; 
use strict; 

my $file = 'two_column.txt' 
open my $fh, '<', $file or die "Can't open $file: $!"; 

my ($left_col, $right_col); 

while (<$fh>) 
{ 
    my ($left, $right) = /(.{37})(.*)/; 

    $left =~ s/\s*$/ /; 

    $left_col .= $left; 
    $right_col .= $right; 
} 
close $fh; 

print $left_col, $right_col, "\n"; 

這將打印整個文本。或加入列,my $text = $left_col . $right_col;

(.{37})匹配任何字符(.),並做到這一點正是37倍({37}),捕獲正則表達式模式與(); (.*)捕獲所有剩餘。這些由正則表達式返回並分配。 $left的尾部空格被壓縮成一個。兩者都被附加(.=)。

或命令行

perl -wne' 
    ($l, $r) = /(.{37})(.*)/; $l =~ s/\s*$/ /; $cL .= $l; $cR .= $r; 
    }{ print $cL,$cR,"\n" 
' two_column.txt 

其中}{開始END塊,即(所有行已被處理後)退出之前運行。

+0

@pachamaltese我編輯了一下,爲了清楚起見,並添加了幾個語句。 – zdim

0

看起來像一個固定寬度的文件,如果總是有一定的寬度在兩列:

dat <- read.fwf(textConnection(txt), widths=c(37,48), stringsAsFactors=FALSE) 
gsub("\\s+", " ", paste(unlist(dat), collapse=" ")) 

這將對這一切在一個大長串:

[1] "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?"