給定一個統一的錯誤，我不明白

這裏是我的代碼，我敢肯定，它看起來可怕，但它所有的作品，因爲它應該只有我有問題是與最後一行...給定一個統一的錯誤，我不明白

import pyPdf 
import os 
import csv 

class UnicodeWriter: 
    """ 
    A CSV writer which will write rows to CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     # Redirect output to a queue 
     self.queue = cStringIO.StringIO() 
     self.writer = csv.writer(self.queue, dialect=dialect, **kwds) 
     self.stream = f 
     self.encoder = codecs.getincrementalencoder(encoding)() 

    def writerow(self, row): 
     self.writer.writerow([s.encode("utf-8") for s in row]) 
     # Fetch UTF-8 output from the queue ... 
     data = self.queue.getvalue() 
     data = data.decode("utf-8") 
     # ... and reencode it into the target encoding 
     data = self.encoder.encode(data) 
     # write to the target stream 
     self.stream.write(data) 
     # empty queue 
     self.queue.truncate(0) 

    def writerows(self, rows): 
     for row in rows: 
      self.writerow(row) 


    PDFWriter = csv.writer(open('/home/nick/TAM_work/text/text.doc', 'a'), delimiter=' ', quotechar='|', quoting=csv.QUOTE_ALL) 

    def getPDFContent(path): 
     content = "" 
     # Load PDF into pyPDF 
     pdf = pyPdf.PdfFileReader(file(path, "rb")) 
     # Iterate pages 
     for i in range(0, pdf.getNumPages()): 
      # Extract text from page and add to content 
      content += pdf.getPage(i).extractText() + "\n" 
     # Collapse whitespace 
     content = " ".join(content.replace(u"\xa0", " ").strip().split()) 
     return content 

    for word in os.listdir("/home/nick/TAM_work/TAM_pdfs"): 
    print getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word) 

    PDFWriter.writerow ([getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word)])

當我運行一切正常，直到它達到這個......

Traceback (most recent call last): 
    File "Saving_fuction_added.py", line 52, in <module> 
    PDFWriter.writerow ([getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word)]) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 81: ordinal not in range(128)

我很樂意幫忙。多謝你們。

Matt

來源

2010-01-07 Matt

你有非ascii文件名嗎？我很困惑，因爲棧跟蹤很短 - 它似乎表明錯誤在列表理解（TAM_pdfs + word）內，而不在writerow（）函數內？ –

我一開始也這麼認爲，但之後不會失敗？ – danben

試圖改變我的.DOC爲.csv並添加嘗試： X =的Unicode（值，「ASCII」）除了UnicodeError：值=的Unicode（值，「UTF-8」）其他：＃值有效的ASCII數據通過但這沒有奏效。也許我看着這個完全錯誤的方式？我只需要將我提取的文本提取到一個csv文件。（[/ home/nick/TAM_work/TAM_pdfs /「+ word）.encode（」ascii「，」ignore「）]）進入for循環，再次修復 – Matt

下面是回答該問題的代碼。但現在它只寫入最後一個文件。

import pyPdf 
import os 
import csv 

class UnicodeWriter: 
    """ 
    A CSV writer which will write rows to CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     # Redirect output to a queue 
     self.queue = cStringIO.StringIO() 
     self.writer = csv.writer(self.queue, dialect=dialect, **kwds) 
     self.stream = f 
     self.encoder = codecs.getincrementalencoder(encoding)() 

    def writerow(self, row): 
     self.writer.writerow([s.encode("utf-8") for s in row]) 
     # Fetch UTF-8 output from the queue ... 
     data = self.queue.getvalue() 
     data = data.decode("utf-8") 
     # ... and reencode it into the target encoding 
     data = self.encoder.encode(data) 
     # write to the target stream 
     self.stream.write(data) 
     # empty queue 
     self.queue.truncate(0) 

    def writerows(self, rows): 
     for row in rows: 
      self.writerow(row) 


PDFWriter = csv.writer(open('/home/nick/TAM_work/text/text.doc', 'a'), delimiter=' ', quotechar='|', quoting=csv.QUOTE_ALL) 

def getPDFContent(path): 
    content = "" 
    # Load PDF into pyPDF 
    pdf = pyPdf.PdfFileReader(file(path, "rb")) 
    # Iterate pages 
    for i in range(0, pdf.getNumPages()): 
     # Extract text from page and add to content 
     content += pdf.getPage(i).extractText() + "\n" 
    # Collapse whitespace 
    content = " ".join(content.replace(u"\xa0", " ").strip().split()) 
    return content 

for word in os.listdir("/home/nick/TAM_work/TAM_pdfs"): 
    print getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word) 

PDFWriter.writerow ([getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word).encode("ascii", "ignore")])

來源

2010-01-07 04:37:55 Matt

。 – Matt

-1

作爲我Underestand你把一個很大的數字在一個小變量，它的拋出異常。

我給你介紹一個C＃的工具，做工非常精細使用Unicode，你可以在你的情況http://unicode.codeplex.com

覺得我要推薦改變

for i in range(0, pdf.getNumPages()):

pdf.getNumPages（）比上述128只是控制它。

來源

2010-01-07 12:31:02

-1 OP的例外情況是一個UnicodeEncodeError，它只能模糊地被定義爲「在小變量中大數」，並且與PDF文件中的頁數無關。至於你未公開的「工具」，你必須說服Python用戶它提供了Python的標準unicode設施之上的東西 - 但請不要將這些言論作爲進一步發送垃圾郵件的邀請，恰恰相反。 –

給定一個統一的錯誤，我不明白

回答

相關問題