UnicodeEncodeError當閱讀PDF文件使用pyPdf

傢伙，我已經發布了一個問題前面pypdf python tool .dont標誌着這是重複的，因爲我得到如下UnicodeEncodeError當閱讀PDF文件使用pyPdf

import sys 
    import pyPdf 

    def convertPdf2String(path): 
     content = "" 
     # load PDF file 
     pdf = pyPdf.PdfFileReader(file(path, "rb")) 
     # iterate pages 
     for i in range(0, pdf.getNumPages()): 
      # extract the text from each page 
      content += pdf.getPage(i).extractText() + " \n" 
     # collapse whitespaces 
     content = u" ".join(content.replace(u"\xa0", u" ").strip().split()) 
     return content 

    # convert contents of a PDF file and store retult to TXT file 
    f = open('a.txt','w+') 
    f.write(convertPdf2String(sys.argv[1])) 
    f.close() 

    # or print contents to the standard out stream 
    print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

指出這個錯誤我得到這個錯誤的1號PDF文件 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) 與此PDF以下錯誤http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)

如何解決這個

來源

2010-10-04 Hulk

您確定您剛剛執行了上面的代碼嗎？ 'u「\ xe7」.encode（「ascii」，「xmlcharrefreplace」）'正確返回「ç」。使用「xmlcharrefreplace」，對於有效的Unicode字符，它不應該失敗。 – AndiDog 2010-10-04 15:48:34

我自己試了一下，得到了同樣的結果。忽略我的評論，我還沒有看到你正在將輸出寫入文件。這就是問題所在：

由於convertPdf2String返回一個Unicode字符串，但file.write只能寫字節，調用f.write嘗試使用ASCII編碼自動轉換成Unicode字符串。由於PDF顯然包含非ASCII字符，因此失敗。因此，它應該是這樣

f.write(convertPdf2String(sys.argv[1]).encode("utf-8")) 
# or 
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))

編輯：

工作的源代碼，只有一行改變。

# Execute with "Hindi_Book.pdf" in the same directory 
import sys 
import pyPdf 

def convertPdf2String(path): 
    content = "" 
    # load PDF file 
    pdf = pyPdf.PdfFileReader(file(path, "rb")) 
    # iterate pages 
    for i in range(0, pdf.getNumPages()): 
     # extract the text from each page 
     content += pdf.getPage(i).extractText() + " \n" 
    # collapse whitespaces 
    content = u" ".join(content.replace(u"\xa0", u" ").strip().split()) 
    return content 

# convert contents of a PDF file and store retult to TXT file 
f = open('a.txt','w+') 
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace")) 
f.close() 

# or print contents to the standard out stream 
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

來源

2010-10-04 16:18:20 AndiDog

@AndiDog：我曾嘗試兩個最初，並無法讓他們work.My最初的目標是隻讀從命令行的pd內容，我不想這樣做使用xpdf – Hulk 2010-10-04 19:03:54

@Hulk：我測試了我寫的東西在我的答案中，在同一個PDF文件上。你是說它不適合你嗎？ – AndiDog 2010-10-04 20:13:51

@AndiDog：它仍然是一樣的錯誤。我試着用這兩個語句 – Hulk 2010-10-05 05:18:08

UnicodeEncodeError當閱讀PDF文件使用pyPdf

回答

相關問題