2014-03-19 47 views
2

我有一個固定寬度的文件,有一些非UTF8字符,我想用空格替換非UTF8字符。如何用空格替換無效的UTF8字符

我試圖運行iconv -f utf8 -t utf8 -c $file 但它做的唯一一件事就是刪除非UTF8字符。使用iconv無法用空格替換它們。

我想要一個korn shell腳本/ Perl腳本來替換所有非UTF8字符的空格。

我發現這個Perl腳本打印非UTF8字符被發現的行,但我不知道任何關於Perl的東西,使它用空格替換非UTF8。

perl -l -ne '/ 
    ^([\000-\177]     # 1-byte pattern 
    |[\300-\337][\200-\277]  # 2-byte pattern 
    |[\340-\357][\200-\277]{2} # 3-byte pattern 
    |[\360-\367][\200-\277]{3} # 4-byte pattern 
    |[\370-\373][\200-\277]{4} # 5-byte pattern 
    |[\374-\375][\200-\277]{5} # 6-byte pattern 
    )*$ /x or print' FILE.dat 

環境AIX

回答

2

Perl的編碼模塊具有這種能力。

#!/usr/bin/perl 

use strict; 
use warnings; 

use Encode qw(encode decode); 

while (<>) { 
    # decode the utf-8 bytes and make them into characters 
    # and turn anything that's invalid into U+FFFD 
    my $string = decode("utf-8", $_); 

    # change any U+FFFD into spaces 
    $string =~ s/\x{fffd}/ /g; 

    # turn it back into utf-8 bytes and print it back out again 
    print encode("utf-8", $string); 
} 

或者更小的命令行版本:

perl -pe 'use Encode; $_ = Encode::decode("utf-8",$_); s/\x{fffd}/ /g; $_ = Encode::encode("utf-8", $_)'