我製作了一個腳本,藉助於tesseract和pyocr將pdf掃描批量處理爲文本。代碼如下。問題在於,在處理大量文件時(例如20+),在某個時刻,腳本內存不足,OSError失敗。我目前已經做到這一點,以便在手動重新啓動後可以平穩地趕上崩潰的地方,但這些手動重新啓動很乏味。因爲pyocr對我來說基本上是一個黑盒子,所以我試圖將腳本包裝到其他Python腳本中,以便在崩潰時重新啓動腳本,但是它們似乎都陷入了這種錯誤,只是在每個相關腳本終止時釋放內存。帶有tesseract的pyocr耗盡內存
我能想到的唯一的其他解決方案是製作一個完全外部的包裝,它將檢查腳本是否正在運行,如果不是,則重新啓動,並且還有未處理的文件。
但也許有更好的解決方案?或者,也許我做了可以改進以避免這些內存崩潰的不良代碼? (其他那我知道這是跛腳,但工作足夠:))。
from io import BytesIO
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
import os
import os.path
import ast
def daemon_ocr(tool, img, lang):
txt = tool.image_to_string(
PI.open(BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
return txt
def daemon_wrap(image_pdf, tool, lang, iteration):
print(iteration)
req_image = []
final_text = ''
image_pdf_bckp = image_pdf
image_jpeg = image_pdf.convert('jpeg')
for img in image_jpeg.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('jpeg'))
for img in req_image:
txt = daemon_ocr(tool, img, lang)
final_text += txt + '\n '
if 'работ' not in final_text and 'фактура' not in final_text and 'Аренда' not in final_text and 'Сумма' not in final_text\
and 'аренде' not in final_text and 'товара' not in final_text:
if iteration < 5:
iteration += 1
image_pdf = image_pdf.rotate(90)
final_text = daemon_wrap(image_pdf_bckp, tool, lang, iteration)
return final_text
def daemon_pyocr(food):
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[0]
iteration = 1
image_pdf = Image(filename='{doc_name}'.format(doc_name=food), resolution=300)
final_text = daemon_wrap(image_pdf, tool, lang, iteration)
return final_text
files = [f for f in os.listdir('.') if os.path.isfile(f)]
output = {}
print(files)
path = os.path.dirname(os.path.abspath(__file__))
if os.path.exists('{p}/output'.format(p=path)):
text_file = open("output", "a")
first = False
else:
text_file = open("output", "w")
first = True
for f in files:
if f != 'ocr.py' and f != 'output':
try:
output[f] = daemon_pyocr(f)
print('{f} done'.format(f=f))
if first:
text_file.write(str(output)[1:-1])
first = False
else:
text_file.write(', {d}'.format(d=str(output)[1:-1]))
output = {}
os.rename('{p}/{f}'.format(p=path, f=f), "{p}/done/{f}".format(p=path, f=f))
except OSError:
print('{f} failed: not enough memory.'.format(f=f))
感謝分享。我這樣做的任務現在已經過去了,但我肯定會嘗試解決方案來實現它。 –