2012-07-25 31 views
0

我寫了一個簡單的程序來收集一些數據中有關bigrams的統計信息。 我將統計信息打印到自定義文件。Hadoop作業輸出中不需要的字符

Path file = new Path(context.getConfiguration().get("mapred.output.dir") + "/bigram.txt"); 
FSDataOutputStream out = file.getFileSystem(context.getConfiguration()).create(file); 

我的代碼有下面幾行:

Text.writeString(out, "total number of unique bigrams: " + uniqBigramCount + "\n"); 
Text.writeString(out, "total number of bigrams: " + totalBigramCount + "\n"); 
Text.writeString(out, "number of bigrams that appear only once: " + onceBigramCount + "\n"); 

我獲得以下在vim/gedit中輸出:

'total number of unique bigrams: 424462 
!total number of bigrams: 1578220 
0number of bigrams that appear only once: 296139 

除在行的開頭不想要的字符,也有一些非打印字符。這背後的原因是什麼?

+0

你怎麼看呢? – 2012-07-25 05:18:20

+0

@Thomas Jungblut:vim/gedit – abhinavkulkarni 2012-07-25 05:28:12

+2

我相信這是導致一些二進制偏斜的字符串(寫在前面)的長度。 – 2012-07-25 07:14:15

回答

1

由於@ThomasJungblut說,該writeString方法寫出兩個值每次調用writeString - 字符串的長度(作爲VINT)和字符串字節:

/** Write a UTF8 encoded string to out 
*/ 
public static int writeString(DataOutput out, String s) throws IOException { 
    ByteBuffer bytes = encode(s); 
    int length = bytes.limit(); 
    WritableUtils.writeVInt(out, length); 
    out.write(bytes.array(), 0, length); 
    return length; 
} 

如果你只是想成爲能夠文本輸出打印到該文件(即人類可讀的),那麼我建議你換行out變量與PrintStream,並且使用中的println或printf的方法:

PrintStream ps = new PrintStream(out); 
ps.printf("total number of unique bigrams: %d\n", uniqBigramCount); 
ps.printf("total number of bigrams: %d\n", totalBigramCount); 
ps.printf("number of bigrams that appear only once: %d\n", onceBigramCount); 
ps.close(); 
+0

@Thomas Jungblut和Chris:謝謝你的回答,Chris的建議奏效了。 – abhinavkulkarni 2012-07-25 22:36:40

相關問題