2013-04-12 26 views
4

給定一個帶有混合編碼的文件(例如utf-8和latin-1),如何配置Emacs在保存文件時將其所有符號「投影」爲單個編碼(例如utf-8)?如何使用混合編碼更正文件?

我做了以下功能來自動化一些清理,但我想我可以找到某處將信息映射到一個編碼中的符號「é」到utf-8某處的「é」以改進函數(或者有人已經寫過這樣的函數)。

(defun jyby/cleanToUTF() 
    "Cleaning to UTF" 
    (interactive) 
    (progn 
     (save-excursion (replace-regexp "अ" "")) 
     (save-excursion (replace-regexp "आ" "")) 
     (save-excursion (replace-regexp "ॆ" "")) 
     ) 
) 

    (global-unset-key [f11]) 
    (global-set-key [f11] 'jyby/cleanToUTF) 

我有許多文件「損壞」混合編碼(由於從瀏覽器與病人字體配置複製粘貼),生成下面的錯誤。我可以通過搜索和替換每個有問題的符號來手工清理它們,方法是使用「」或適當的字符,或者更快速地指定「utf-8-unix」作爲編碼(下次編輯和保存時會提示相同的消息文件)。它已經成爲一個問題,因爲在任何這樣的損壞的文件中,任何加重字符被在每次保存時大小加倍大小的序列取代,最終使文件大小加倍。我使用GNU Emacs的24.2.1

These default coding systems were tried to encode text 
in the buffer `test_accents.org': 
(utf-8-unix (30 . 4194182) (33 . 4194182) (34 . 4194182) (37 
. 4194182) (40 . 4194181) (41 . 4194182) (42 . 4194182) (45 
. 4194182) (48 . 4194182) (49 . 4194182) (52 . 4194182)) 
However, each of them encountered characters it couldn't encode: 
utf-8-unix cannot encode these:   ... 

Click on a character (or switch to this window by `C-x o' 
and select the characters by RET) to jump to the place it appears, 
where `C-u C-x =' will give information about it. 

Select one of the safe coding systems listed below, 
or cancel the writing with C-g and edit the buffer 
to remove or modify the problematic characters, 
or specify any other coding system (and risk losing 
the problematic characters). 

raw-text emacs-mule no-conversion 
+0

但是有沒有辦法自動轉換它?目前,我手動選擇每個違規字符,並執行搜索並替換以在整個文檔中刪除它。我打算寫一個lisp函數來自動化這個,但我不知道如何自動化出錯的字符列表(另外我希望能做更聰明的事情,比如é - > e,或者更聰明的東西來突出強調的特徵在UTF-8 ...) – Jeremy

回答

1

這裏的東西,也許你開始:

(put 'eof-error 'error-conditions '(error eof-error)) 
(put 'eof-error 'error-message "End of stream") 
(put 'bad-byte 'error-conditions '(error bad-byte)) 
(put 'bad-byte 'error-message "Not a UTF-8 byte") 

(defclass stream() 
    ((bytes :initarg :bytes :accessor bytes-of) 
    (position :initform 0 :accessor position-of))) 

(defun logbitp (byte bit) (not (zerop (logand byte (ash 1 bit))))) 

(defmethod read-byte ((this stream) &optional eof-error eof) 
    (with-slots (bytes position) this 
    (if (< position (length bytes)) 
     (prog1 (aref bytes position) (incf position)) 
     (if eof-error (signal eof-error (list position)) eof)))) 

(defmethod unread-byte ((this stream)) 
    (when (> (position-of this) 0) (decf (position-of this)))) 

(defun read-utf8-char (stream) 
    (let ((byte (read-byte stream 'eof-error))) 
    (if (not (logbitp byte 7)) byte 
     (let ((numbytes 
      (cond 
       ((not (logbitp byte 5)) 
       (setf byte (logand #2r11111 byte)) 1) 
       ((not (logbitp byte 4)) 
       (setf byte (logand #2r1111 byte)) 2) 
       ((not (logbitp byte 3)) 
       (setf byte (logand #2r111 byte)) 3)))) 
     (dotimes (b numbytes byte) 
      (let ((next-byte (read-byte stream 'eof-error))) 
      (if (and (logbitp next-byte 7) (not (logbitp next-byte 6))) 
       (setf byte (logior (ash byte 6) (logand next-byte #2r111111))) 
       (signal 'bad-byte (list next-byte))))) 
     (signal 'bad-byte (list byte)))))) 

(defun load-corrupt-file (file) 
    (interactive "fFile to load: ") 
    (with-temp-buffer 
    (set-buffer-multibyte nil) 
    (insert-file-literally file) 
    (with-output-to-string 
     (set-buffer-multibyte t) 
     (loop with stream = (make-instance 'stream :bytes (buffer-string)) 
      for next-char = 
      (condition-case err 
       (read-utf8-char stream) 
       (bad-byte (message "Fix this byte %d" (cdr err))) 
       (eof-error nil)) 
      while next-char 
      do (write-char next-char))))) 

這段代碼的含義 - 它加載一個文件,沒有任何轉換,並嘗試讀取它,好像它使用UTF-8進行編碼,一旦它遇到一個似乎不屬於UTF-8的字節,它就會出錯,並且您需要以某種方式處理它,這是其中的"Fix this byte"消息所在)。但你需要發明你如何修復它...

+0

看起來有趣 - 謝謝!但是當我把它放在一個緩衝區中並對其進行評估時,我得到了「eval-region:Symbol的函數定義是void:defclass」,谷歌並不理解這一點,而我不明白EIEIO似乎是相關的。我在ubuntu上使用GNU Emacs 23.3.1 – nealmcb

2

我在emacs上多次苦苦掙扎。當我有一個混亂的文件,例如在raw-text-unix模式下,保存爲utf-8,emacs甚至會對已經乾淨的utf-8文本進行投訴。我還沒有找到辦法讓它只抱怨非utf-8。

我只是找到了一個合理的半自動化的方式使用重新編碼:

f=mixed-file 
recode -f ..utf-8 $f > /tmp/recode.out 
diff $f recode.out | cat -vt 

# manually fix lines of text that can't be converted to utf-8 in $f, 
# and re-run recode and diff until the output diff is empty. 

沿途一個有用的工具是http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=342+200+224&mode=obytes

然後我重新用Emacs打開文件,並且要認識到作爲乾淨的unicode。