傢伙,我已經發布了一個問題前面pypdf python tool .dont標誌着這是重複的,因爲我得到如下UnicodeEncodeError當閱讀PDF文件使用pyPdf
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]))
f.close()
# or print contents to the standard out stream
print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
指出這個錯誤我得到這個錯誤的1號PDF文件 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
與此PDF以下錯誤http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)
如何解決這個
您確定您剛剛執行了上面的代碼嗎? 'u「\ xe7」.encode(「ascii」,「xmlcharrefreplace」)'正確返回「ç」。使用「xmlcharrefreplace」,對於有效的Unicode字符,它不應該失敗。 – AndiDog 2010-10-04 15:48:34