與R做OCR

我一直在嘗試做R內的OCR（讀取PDF數據，將數據作爲掃描圖像）。已閱讀約稿@http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/與R做OCR

這是一篇很好的文章。

有效3個步驟：

PDF格式轉換爲PPM（圖像格式）
轉換ppm至TIF準備的tesseract（使用ImageMagick用於轉換）
轉換TIF到文本文件

上述3個步驟按照鏈接帖子的有效代碼：

lapply(myfiles, function(i){ 
    # convert pdf to ppm (an image format), just pages 1-10 of the PDF 
    # but you can change that easily, just remove or edit the 
    # -f 1 -l 10 bit in the line below 
    shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook"))) 
    # convert ppm to tif ready for tesseract 
    shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif"))) 
    # convert tif to text file 
    shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng"))) 
    # delete tif file 
    file.remove(paste0(i, ".tif")) 
    })

前兩步很好。（儘管需要花費很長時間，對於4頁的pdf，但稍後會查看可伸縮性部分，首先嚐試是否可行）

運行此操作時，第一個兩步工作正常。

雖然runinng的第三步驟，即

shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))

我具有這種錯誤：

Error: evaluation nested too deeply: infinite recursion/options(expressions=)?

或者超正方體崩潰。

任何解決方法或根本原因分析將不勝感激。

來源

2015-08-13 r_analytics

你可以提供'myfiles'的內容嗎？ – bdecaf

@bdecaf - 不幸的是，我不能，由於數據安全問題。基本上它的公司財務報表（掃描圖像）是在pdf（4頁）內。這個單一的PDF是在我的文件。這不是一個問題（這是我的想法，但更多的是tesseract問題。 –

@r_analytics您是否找到針對您的問題的解決方案？ –

新發布的tesseract包可能值得一試。它允許您在R中執行整個流程而無需調用shell。

以程序作爲help documentation of the tesseract package你的函數使用將是這個樣子：

lapply(myfiles, function(i){ 
    # convert pdf to jpef/tiff and perform tesseract OCR on the image 

    # Read in the PDF 
    pdf <- pdf_text(i) 
    # convert pdf to tiff 
    bitmap <- pdf_render_page(news, dpi = 300) 
    tiff::writeTIFF(bitmap, paste0(i, ".tiff")) 
    # perform OCR on the .tiff file 
    out <- ocr(paste0, (".tiff")) 
    # delete tiff file 
    file.remove(paste0(i, ".tiff")) 
})

來源

2016-11-22 15:13:20

通過使用「正方體」，我創建了works.Even它適用於掃描的PDF太示例腳本。

library(tesseract) 
library(pdftools) 

# Render pdf to png image 

img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff', dpi = 400) 

# Extract text from png image 
text <- ocr(img_file) 
write.table(text, "F:/gowtham/A/B/mydata.txt")

我是R和Programming的新手。指導我，如果它是錯誤的。希望這對你有所幫助。

來源

2017-11-23 16:51:16 Lakshmana

回答

相關問題