Python：給我帶來問題的特殊字符（來自PDFminer）

我使用PDFminer的pdf2text將PDF縮減爲文本。不幸的是它包含特殊字符。讓我告訴從我的控制檯Python：給我帶來問題的特殊字符（來自PDFminer）

>>>a=pdf_to_text("ap.pdf")

繼承人它的一個樣本，一個小截

>>>a[5000:5500] 
'f one architect. Decades ...... but to re\xef\xac\x82ect\none set of design ideas, than to have one that contains many\ngood but independent and uncoordinated ideas.\n1 Joshua Bloch, \xe2\x80\x9cHow to Design a Good API and Why It Matters\xe2\x80\x9d, G......=-3733'

我明白，我必須對其進行編碼

>>>a[5000:5500].encode('utf-8') 
Traceback (most recent call last): 
    File "<interactive input>", line 1, in <module> 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 237: ordinal not in range(128)

我搜索了一下週圍和輸出試過了，特別是Replace special characters in python。輸入來自PDFminer，所以它很難控制（AFAIK）。從這個輸出中產生適當的明文的方法是什麼？

我在做什麼錯？

--a快速修復：改變PDFminer的編解碼器ascii-，但它不是一個持久的solution--

--Abandoned快速的答案 - 改變編解碼器修復刪除信息 -

由馬克西姆 http://en.wikipedia.org/wiki/Windows-1251提到

--a relavent主題 -

來源

2011-07-29 aitchnyu

感謝這個問題！ Im初學者在Python中，你可能會張貼一個演示代碼如何使用Pdfminer，以便這個錯誤不裝飾器？謝謝 –

此問題時，非ASCII文本存儲在str對象經常發生。你要做的是在utf-8中編碼一個已經用某種編碼進行編碼的字符串（因爲它包含代碼高於0x7f的字符）。

要在utf-8中編碼這樣的字符串，必須首先對其進行解碼。假設原來的文字編碼是cp1251（與你的實際編碼替換它），像下面會做的伎倆：

u = s.decode('cp1251') # decode from cp1251 byte (str) string to unicode string 
s = u.encode('utf-8') # re-encode unicode string to utf-8 byte (str) string

基本上，上面的代碼做什麼iconv --from-code=CP1251 --to-code=UTF-8命令的作用，也就是說，它把從串一個編碼到另一個。

一些有用的鏈接：

來源

2011-07-29 13:06:53

是的，這工作接近完美！我收到了一些文物，如「......Ð²ÐÑÑš提供Custo ..」，但它是一個由業餘愛好者爲最大浮華製作的PDF。 Cleaner PDFs被幹淨地解析。 – aitchnyu

不錯，你需要知道你的輸入編碼。 –

我一定會把它寫成'a.decode（'cp1250'）。encode（'utf-8'）'。 –

Python：給我帶來問題的特殊字符（來自PDFminer）

回答

相關問題