2010-01-07 18 views
1

這裏是我的代碼,我敢肯定,它看起來可怕,但它所有的作品,因爲它應該只有我有問題是與最後一行...給定一個統一的錯誤,我不明白

import pyPdf 
import os 
import csv 

class UnicodeWriter: 
    """ 
    A CSV writer which will write rows to CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     # Redirect output to a queue 
     self.queue = cStringIO.StringIO() 
     self.writer = csv.writer(self.queue, dialect=dialect, **kwds) 
     self.stream = f 
     self.encoder = codecs.getincrementalencoder(encoding)() 

    def writerow(self, row): 
     self.writer.writerow([s.encode("utf-8") for s in row]) 
     # Fetch UTF-8 output from the queue ... 
     data = self.queue.getvalue() 
     data = data.decode("utf-8") 
     # ... and reencode it into the target encoding 
     data = self.encoder.encode(data) 
     # write to the target stream 
     self.stream.write(data) 
     # empty queue 
     self.queue.truncate(0) 

    def writerows(self, rows): 
     for row in rows: 
      self.writerow(row) 


    PDFWriter = csv.writer(open('/home/nick/TAM_work/text/text.doc', 'a'), delimiter=' ', quotechar='|', quoting=csv.QUOTE_ALL) 

    def getPDFContent(path): 
     content = "" 
     # Load PDF into pyPDF 
     pdf = pyPdf.PdfFileReader(file(path, "rb")) 
     # Iterate pages 
     for i in range(0, pdf.getNumPages()): 
      # Extract text from page and add to content 
      content += pdf.getPage(i).extractText() + "\n" 
     # Collapse whitespace 
     content = " ".join(content.replace(u"\xa0", " ").strip().split()) 
     return content 

    for word in os.listdir("/home/nick/TAM_work/TAM_pdfs"): 
    print getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word) 

    PDFWriter.writerow ([getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word)]) 

當我運行一切正常,直到它達到這個......

Traceback (most recent call last): 
    File "Saving_fuction_added.py", line 52, in <module> 
    PDFWriter.writerow ([getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word)]) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 81: ordinal not in range(128) 

我很樂意幫忙。多謝你們。

Matt

+0

你有非ascii文件名嗎?我很困惑,因爲棧跟蹤很短 - 它似乎表明錯誤在列表理解(TAM_pdfs + word)內,而不在writerow()函數內? –

+0

我一開始也這麼認爲,但之後不會失敗? – danben

+0

試圖改變我的.DOC爲.csv並添加 嘗試: X =的Unicode(值, 「ASCII」) 除了UnicodeError: 值=的Unicode(值, 「UTF-8」) 其他: #值有效的ASCII數據 通過 但這沒有奏效。 也許我看着這個完全錯誤的方式?我只需要將我提取的文本提取到一個csv文件。 ([/ home/nick/TAM_work/TAM_pdfs /「+ word).encode(」ascii「,」ignore「)]) 進入for循環,再次修復 – Matt

回答

1

下面是回答該問題的代碼。但現在它只寫入最後一個文件。

import pyPdf 
import os 
import csv 

class UnicodeWriter: 
    """ 
    A CSV writer which will write rows to CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     # Redirect output to a queue 
     self.queue = cStringIO.StringIO() 
     self.writer = csv.writer(self.queue, dialect=dialect, **kwds) 
     self.stream = f 
     self.encoder = codecs.getincrementalencoder(encoding)() 

    def writerow(self, row): 
     self.writer.writerow([s.encode("utf-8") for s in row]) 
     # Fetch UTF-8 output from the queue ... 
     data = self.queue.getvalue() 
     data = data.decode("utf-8") 
     # ... and reencode it into the target encoding 
     data = self.encoder.encode(data) 
     # write to the target stream 
     self.stream.write(data) 
     # empty queue 
     self.queue.truncate(0) 

    def writerows(self, rows): 
     for row in rows: 
      self.writerow(row) 


PDFWriter = csv.writer(open('/home/nick/TAM_work/text/text.doc', 'a'), delimiter=' ', quotechar='|', quoting=csv.QUOTE_ALL) 

def getPDFContent(path): 
    content = "" 
    # Load PDF into pyPDF 
    pdf = pyPdf.PdfFileReader(file(path, "rb")) 
    # Iterate pages 
    for i in range(0, pdf.getNumPages()): 
     # Extract text from page and add to content 
     content += pdf.getPage(i).extractText() + "\n" 
    # Collapse whitespace 
    content = " ".join(content.replace(u"\xa0", " ").strip().split()) 
    return content 

for word in os.listdir("/home/nick/TAM_work/TAM_pdfs"): 
    print getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word) 

PDFWriter.writerow ([getPDFContent("/home/nick/TAM_work/TAM_pdfs/" + word).encode("ascii", "ignore")]) 
+0

。 – Matt

-1

作爲我Underestand你把一個很大的數字在一個小變量,它的拋出異常。

我給你介紹一個C#的工具,做工非常精細使用Unicode,你可以在你的情況http://unicode.codeplex.com

覺得我要推薦改變

for i in range(0, pdf.getNumPages()): 

pdf.getNumPages()比上述128只是控制它。

+0

-1 OP的例外情況是一個UnicodeEncodeError,它只能模糊地被定義爲「在小變量中大數」,並且與PDF文件中的頁數無關。至於你未公開的「工具」,你必須說服Python用戶它提供了Python的標準unicode設施之上的東西 - 但請不要將這些言論作爲進一步發送垃圾郵件的邀請,恰恰相反。 –