如何在hadoop的shuffle/sort階段進行數值排序？

的數據是這樣的，第一個字段是一個數字，如何在hadoop的shuffle/sort階段進行數值排序？

3 ... 
1 ... 
2 ... 
11 ...

而且我想根據第一場數字而非字母順序這些行進行排序，排序就應該這個樣子後，這意味着，

1 ... 
2 ... 
3 ... 
11 ...

但Hadoop的不斷給我這個，

1 ... 
11 ... 
2 ... 
3 ...

如何糾正？

來源

2012-11-11 Alcott

假設你正在使用Hadoop的流，您需要使用KeyFieldBasedComparator類。

-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator應該被添加到流命令
您需要提供使用排序所需的類型mapred.text.key.comparator.options。一些有用的是-n：數值排序，-r：反向排序

例：

創建一個身份映射器和減速器用下面的代碼

這是映射。 PY & reducer.py

#!/usr/bin/env python 
import sys 
for line in sys.stdin:  
    print "%s" % (line.strip())

這是的輸入。TXT

這是流命令

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator 
-D mapred.text.key.comparator.options=-n 
-input /user/input.txt 
-output /user/output.txt 
-file ~/mapper.py 
-mapper ~/mapper.py 
-file ~/reducer.py 
-reducer ~/reducer.py

，你會得到所需要的輸出

注意：

我已經使用了一個簡單的一鍵輸入。但是，如果您有多個鍵和/或分區，則必須根據需要編輯mapred.text.key.comparator.options。由於我不知道自己的用例，因此我的示例僅限於此示例
標識映射器是必需的，因爲您需要至少一個映射器才能運行MR作業。
鑑別縮減器是必需的，因爲如果純粹的僅地圖作業，混洗/分類階段將不起作用。

來源

2012-11-12 11:32:27

非常感謝您的代碼示例 – Alcott

是有可能改變排序順序嗎？ – masu

Hadoop的默認比較器根據您使用的Writable類型（更確切地說是WritableComparable）比較您的密鑰。如果您正在處理IntWritable或LongWritable，那麼它會將它們按的數字排序。

我假設你在你的例子中使用Text因此你最終會得到自然排序順序。

但是，在特殊情況下，您也可以編寫自己的比較器。
如：僅用於測試目的，這裏有一個快速的樣本如何改變文本鍵的排列順序：這將把它們作爲整數和將產生的數字排列順序：

public class MyComparator extends WritableComparator { 

     public MyComparator() { 
      super(Text.class); 
     } 

     @Override 
     public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { 

      try { 

       String v1 = Text.decode(b1, s1, l1); 
       String v2 = Text.decode(b2, s2, l2); 

       int v1Int = Integer.valueOf(v1.trim()); 
       int v2Int = Integer.valueOf(v2.trim()); 

       return (v1Int < v2Int) ? -1 : ((v1Int > v2Int) ? 1 : 0); 

      } 
      catch (IOException e) { 
       throw new IllegalArgumentException(e); 
      } 
     } 
    }

在jobrunner類集合：

Job job = new Job(); 
... 
job.setSortComparatorClass(MyComparator.class);

來源

2012-11-11 16:47:41

謝謝，但我不寫'java'。 – Alcott

@Alcott：對於'Hadoop-streaming'請參考：http://hadoop.apache.org/docs/r1.0.4/streaming.html#Hadoop+Comparator+Class –

如何在hadoop的shuffle/sort階段進行數值排序？

回答

相關問題