使用perl存儲在JSON中的utf8文件名的規範化

我有兩個來自不同操作系統的Json文件。使用perl存儲在JSON中的utf8文件名的規範化

這兩個文件都編碼在UTF-8和UTF-8編碼filenames。

一個文件來自OS X和文件名是NFD形式：（od -bc）

0000160 166 145 164 154 141 314 201 057 110 157 165 163 145 040 155 145 
      v e t l a ́ ** / H o u s e  m e

第二個是包含文件名相同，但在NFC形式：

000760 166 145 164 154 303 241 057 110 157 165 163 145 040 155 145 163 
      v e t l á ** / H o u s e  m e s

正如我已經學會，這被稱爲「不同的標準化」，並且存在用於處理它的CPAN模塊Unicode::Normalize。

我在讀這兩個文件與未來：

my $json1 = decode_json read_file($file1, {binmode => ':raw'}) or die "..." ; 
my $json2 = decode_json read_file($file2, {binmode => ':raw'}) or die "..." ;

的READ_FILE是File::Slurp和decode_json從JSON::XS。

將JSON讀入perl結構，從一個json文件中，文件名將進入key位置，並從第二個文件進入values。我需要搜索從第一個哈希得到的散列key是從等價於到value第二個哈希，因此需要確保它們比「二進制」相同。

試了下：

grep 'House' file1.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc

和

grep 'House' file2.json | perl -CSAD -MUnicode::Normalize -nlE 'print NFD($_)' | od -bc

產生對我來說是相同的輸出。

現在的問題：

如何簡單同時讀取JSON文件來獲得相同的正常化到兩個$hashrefs？

或需要在decode_json之後運行這樣的哈希？

while(my($k,$v) = each(%$json1)) { 
    $copy->{ NFD($k) } = NFD($v); 
}

簡而言之：

如何閱讀不同的JSON文件得到 '內部' perl的$href相同正常化？如果明確地對每個keyvalue執行NFD並創建另一個NFD歸一化（大）副本的哈希值，可能會稍微好一點。

一些提示，建議 - 請......

因爲我的英語很糟糕，這裏是一個問題的模擬

use 5.014; 
use warnings; 

use utf8; 
use feature qw(unicode_strings); 
use charnames qw(:full); 
use open qw(:std :utf8); 
use Encode qw(encode decode); 
use Unicode::Normalize qw(NFD NFC); 

use File::Slurp; 
use Data::Dumper; 
use JSON::XS; 

#Creating two files what contains different "normalizations" 
my($nfc, $nfd);; 
$nfc->{ NFC('key') } = NFC('vál'); 
$nfd->{ NFD('vál') } = 'something'; 

#save as NFC - this comes from "FreeBSD" 
my $jnfc = JSON::XS->new->encode($nfc); 
open my $fd, ">:utf8", "nfc.json" or die("nfc"); 
print $fd $jnfc; 
close $fd; 

#save as NFD - this comes from "OS X" 
my $jnfd = JSON::XS->new->encode($nfd); 
open $fd, ">:utf8", "nfd.json" or die("nfd"); 
print $fd $jnfd; 
close $fd; 

#now read them 
my $jc = decode_json read_file("nfc.json", { binmode => ':raw' }) or die "No file" ; 
my $jd = decode_json read_file("nfd.json", { binmode => ':raw' }) or die "No file" ; 

say $jd->{ $jc->{key} } // "NO FOUND"; #wanted to print "something" 

my $jc2; 
#is here a better way to DO THIS? 
while(my($k,$v) = each(%$jc)) { 
    $jc2->{ NFD($k) } = NFD($v); 
} 
say $jd->{ $jc2->{key} } // "NO FOUND"; #OK

來源

2013-07-02 kobame

當搜索你的問題的解決方案我發現：該軟件是C * RP :)請參見： https://stackoverflow.com/a/17448888/632407。

無論如何，找到適合您的特定問題的解決方案 - 如何讀取文件名的JSON無論正常化：

你的，而不是：

#now read them 
my $jc = decode_json read_file("nfc.json", { binmode => ':raw' }) or die "No file" ; 
my $jd = decode_json read_file("nfd.json", { binmode => ':raw' }) or die "No file" ;

使用下一個：

#now read them 
my $jc = get_json_from_utf8_file('nfc.json') ; 
my $jd = get_json_from_utf8_file('nfd.json') ; 
... 

sub get_json_from_utf8_file { 
    my $file = shift; 
    return 
     decode_json  #let parse the json to perl 
     encode 'utf8', #the decode_json want utf8 encoded binary string, encode it 
      NFC   #conv. to precomposed normalization - regardless of the source 
      read_file #your file contains utf8 encoded text, so read it correctly 
       $file, { binmode => ':utf8' } ; 
}

這應該（至少我希望）確保什麼分解使用JSON內容，NFC將它轉換爲預合成版本和JSON：XS將讀取正確的解析到相同的內部perl結構。

所以，你的例子打印：

something

而不移動$json

的想法來源於約瑟夫·邁爾斯和尼莫）

也許有些更熟練的程序員將給予更多的提示。

來源

2013-07-03 15:43:18 jm666

同時發現自己正是這個解決方案，謝謝。 – kobame

即使它可能是很重要的，現在只有幾個文件名稱轉換爲相同的標準化進行比較，其他如果JSON數據具有不同的標準化，幾乎可以在任何地方出現意想不到的問題。

所以我的建議是在做任何解析之前（即在讀取文件的同時，在decode_json之前）將來自兩個源的整個輸入標準化爲第一步。這不應該破壞你的任何JSON結構，因爲它們是用ASCII字符分隔的。那麼你現有的Perl代碼應該能夠盲目地假設所有的UTF8字符都具有相同的規範化。

$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "..."; 
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "..."; 

my $json1 = decode_json NFD($rawdata1); 
my $json2 = decode_json NFD($rawdata2);

爲了使這個過程稍微快一點（它應該是足夠快不已，因爲該模塊使用快速XS程序），你可以找出兩個數據文件中的一個是否已經在一定的範式，並然後保持該文件不變，並將其他文件轉換爲該格式。

例如：

$rawdata1 = read_file($file1, {binmode => ':raw'}) or die "..."; 
$rawdata2 = read_file($file2, {binmode => ':raw'}) or die "..."; 

if (checkNFD($rawdata1)) { 
    # then you know $file1 is already in Normalization Form D 
    # (i.e., it was formed by canonical decomposition). 
    # so you only need to convert $file2 into NFD 
    $rawdata2 = NFD($rawdata2); 
} 
my $json1 = decode_json $rawdata1; 
my $json2 = decode_json $rawdata2;

當然，你自然會在發展時，現在進行實驗，看是否有其他的輸入文件已經是一個規範化的形式，然後在您的最終代碼的版本，您將不再需要一個條件語句，而只是其他的輸入文件轉換成相同的標準化形式。

還要注意的是，建議將產生NFC形式輸出（如果你的程序產生，將被儲存，然後用於任何輸出）。看到這裏，例如：http://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html

來源

2013-07-02 19:23:22

不幸的是，'decode_json NFD（read_file（「nfd.json」，{binmode =>'：raw'}））'不起作用。獲取錯誤信息：'在read2.pl第18行的子程序入口處出現寬字符。'抱怨'decode_json'。 – kobame

是的，像這樣的問題是單調乏味的，而且通常應該起作用的簡單解決方案不起作用。讓我仔細研究一下，看看我能不能爲你診斷（我正在試圖和現在一起輔導SO和一個輔導中心）。這個問題就像我每週處理的問題一樣，很多意想不到的怪癖表明這個世界在Unicode方面還不夠完美 - 儘管當我指出現實時，我只是得到了一堆downvotes像這裏:(http://stackoverflow.com/questions/16594636/using-html-symbol-entities-instead-of-the-actual-symbol/16594962#16594962。 –

我建議你使用NFC作爲你的規範化，然後在通過to decode_json。試試吧。 –

-1

手動遍歷數據結構代替，讓一個模塊，您處理該問題。

來源

2013-07-02 20:22:22 daxim

嗯。我無法爲您提供更好的「編程」解決方案。但爲什麼根本不運行

perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' <freebsd.json>bsdok.json 
perl -CSDA -MUnicode::Normalize -0777 -nle 'print NFD($_)' < osx.json  >osxok.json

現在你的腳本可以讀取和使用兩者，因爲它們都在相同的規範化？裏面腳本的所以不是搜索索姆編程解決方案，進入到腳本之前解決問題。（第二個命令是不必要的 - 在文件級簡單的轉換肯定是更容易的遍歷數據結構...

來源

2013-07-02 20:54:19 Nemo

使用perl存儲在JSON中的utf8文件名的規範化

回答

相關問題