Hadoop作業輸出中不需要的字符

我寫了一個簡單的程序來收集一些數據中有關bigrams的統計信息。我將統計信息打印到自定義文件。Hadoop作業輸出中不需要的字符

Path file = new Path(context.getConfiguration().get("mapred.output.dir") + "/bigram.txt"); 
FSDataOutputStream out = file.getFileSystem(context.getConfiguration()).create(file);

我的代碼有下面幾行：

Text.writeString(out, "total number of unique bigrams: " + uniqBigramCount + "\n"); 
Text.writeString(out, "total number of bigrams: " + totalBigramCount + "\n"); 
Text.writeString(out, "number of bigrams that appear only once: " + onceBigramCount + "\n");

我獲得以下在vim/gedit中輸出：

'total number of unique bigrams: 424462 
!total number of bigrams: 1578220 
0number of bigrams that appear only once: 296139

除在行的開頭不想要的字符，也有一些非打印字符。這背後的原因是什麼？

來源

2012-07-25 abhinavkulkarni

你怎麼看呢？ – 2012-07-25 05:18:20

@Thomas Jungblut：vim/gedit – abhinavkulkarni 2012-07-25 05:28:12

我相信這是導致一些二進制偏斜的字符串（寫在前面）的長度。 – 2012-07-25 07:14:15

由於@ThomasJungblut說，該writeString方法寫出兩個值每次調用writeString - 字符串的長度（作爲VINT）和字符串字節：

/** Write a UTF8 encoded string to out 
*/ 
public static int writeString(DataOutput out, String s) throws IOException { 
    ByteBuffer bytes = encode(s); 
    int length = bytes.limit(); 
    WritableUtils.writeVInt(out, length); 
    out.write(bytes.array(), 0, length); 
    return length; 
}

如果你只是想成爲能夠文本輸出打印到該文件（即人類可讀的），那麼我建議你換行out變量與PrintStream，並且使用中的println或printf的方法：

PrintStream ps = new PrintStream(out); 
ps.printf("total number of unique bigrams: %d\n", uniqBigramCount); 
ps.printf("total number of bigrams: %d\n", totalBigramCount); 
ps.printf("number of bigrams that appear only once: %d\n", onceBigramCount); 
ps.close();

來源

2012-07-25 10:35:15

@Thomas Jungblut和Chris：謝謝你的回答，Chris的建議奏效了。 – abhinavkulkarni 2012-07-25 22:36:40

Hadoop作業輸出中不需要的字符

回答

相關問題