Textract無法讀取JpegImageFile（StringIO對象）

我認爲內存中類似文件的對象應該像文件一樣行爲。我沒能獲得Textract「讀」一個Textract無法讀取JpegImageFile（StringIO對象）

<StringIO.StringIO instance at 0x05039EB8>

雖然程序運行正常，如果我保存JPEG文件保存到磁盤，並在正常過程讀取。

jpeg文件正在從pdf中提取，每Ned Batchelder的優秀博客Extracting JPGs from PDFs。相關代碼如下：

type(jpg) --> str (on 2.7) 
buff = StringIO.StringIO() 
buff.write(jpg) 
buff.seek(0) 
type(buff) --> instance 
print buff --><StringIO.StringIO instance at 0x05039EB8> 
dt=Image.open(buff) 
print dt --><PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2630x597 at 0x58C2A90> 
text=textract.process(dt)`

此行失敗。 Textract無法讀取JpegImageFile 如果我做

text=textract.process(buff.getvalue())

我得到一個錯誤：must be encoded string without NULL bytes, not str

我如何Textract從內存中的文件或流讀？

來源

2017-04-04 Pradeep

我找到了解決辦法;內存中的文件不是處理遺留代碼的方式。將jpg提取路由到硬編碼的tempfile。

tempfile.NamedTemporaryFile

將數據流寫入tempfile和textract.process它有點乏味，我無法弄清BytesIO/StringIO是如何將字節流傳遞給textract的。根據Textract文檔，它期望一個文件。更新的變通辦法代碼片段：

pdf = file('file name', "rb").read() 

startmark = "\xff\xd8" 
startfix = 0 
endmark = "\xff\xd9" 
endfix = 2 
i = 0 

njpg = 0 
while True: 
    istream = pdf.find("stream", i) 
    if istream < 0: 
     break 
    istart = pdf.find(startmark, istream, istream+20) 
    if istart < 0: 
    i = istream+20 
     continue 
    iend = pdf.find("endstream", istart) 
    if iend < 0: 
     raise Exception("Didn't find end of stream!") 
    iend = pdf.find(endmark, iend-20) 
    if iend < 0: 
     raise Exception("Didn't find end of JPG!") 

    istart += startfix 
    iend += endfix 
    print "JPG %d from %d to %d" % (njpg, istart, iend) 
    jpg = pdf[istart:iend] 

    njpg += 1 
    i = iend 

import tempfile 
temp=tempfile.NamedTemporaryFile(delete=False,suffix='.jpg') 
temp.write(jpg) 
temp.close() 
text=textract.process(temp.name) 
print text

信息：Win7上的Python 2.7;強制UTF-8編碼

reload(sys) 
sys.setdefaultencoding('UTF8').

希望這可以幫助別人，因爲textract實際上是一個很大的一段代碼。 pdf轉換爲jpeg轉換器代碼由Ned Batchelder提供Extracting JPGs from PDFs（2007）。

來源

2017-04-10 08:05:16 Pradeep

Textract無法讀取JpegImageFile（StringIO對象）

回答

相關問題