如何將一組文本作爲一個整體映射到節點？

假設我有以下數據的純文本文件：如何將一組文本作爲一個整體映射到節點？

DataSetOne <br /> 
content <br /> 
content <br /> 
content <br /> 


DataSetTwo <br /> 
content <br /> 
content <br /> 
content <br /> 
content <br />

...等等...

我想的就是什麼：在計算每個數據集有多少內容。例如，結果應該是

<DataSetOne, 3>, <DataSetTwo, 4>

我是Hadoop的一個beginer，我不知道是否有對數據塊映射作爲一個整體到節點的方式。例如，將所有DataSetOne設置爲節點1，將所有DataSetTwo設置爲節點2.

有沒有人可以給我一個想法如何存檔？

來源

2011-01-13 Terminal User

首先還是被劃分爲多個地圖，如果他們是在單獨的文件，如果它們超出配置的塊大小。所以如果你有一個128MB的數據集，你的分塊大小是64mb，hadoop將會2塊這個文件併爲每個設置2個映射器。
這就像hadoop教程中的wordcount示例。就像大衛說的那樣，您需要將鍵/值對映射到HDFS，然後減少它們。我會實現，像這樣：

// field in the mapper class 
int groupId = 0; 

@Override 
protected void map(K key, V value, Context context) throws IOException, 
     InterruptedException { 
    if(key != groupId) 
     groupId = key; 
    context.write(groupId, value); 
} 

@Override 
protected void reduce(K key, Iterable<V> values, 
     Context context) 
     throws IOException, InterruptedException { 
    int size = 0; 
    for(Value v : values){ 
     size++; 
    } 
    context.write(key, size); 
}

像大衛說藏漢你可以使用組合。組合器是簡單的縮減器，用於在地圖和縮小階段之間節省資源。它們可以在配置中設置。

來源

2011-01-15 18:06:00