如何將多個輸入格式文件傳遞給map-reduce作業？

我在寫map-reduce程序來查詢cassandra列族。我只需要從一個列族中只讀取行的子集（使用行鍵）。我有我必須閱讀的一組行鍵行。我如何將「行鍵集合」傳遞給map reduce作業，以便它只能從cassandra columnfamily輸出行的子集？如何將多個輸入格式文件傳遞給map-reduce作業？

摘要：

enter code here 

    class GetRows() 
    { 
    public set<String> getRowKeys() 
    { 
    logic..... 
    return set<string>; 
    } 
    } 


    class MapReduceCassandra() 
    { 
    inputformat---columnFamilyInputFormat 
    . 
    ; 
    also need input key-set .. How to get it? 
    }

任何一個可以建議從Java應用程序，以及如何通過設置鍵來調用的MapReduce MapReduce的最好方法是什麼？

來源

2014-02-20 Anudeep

調用地圖從Java減少

要做到這一點，你可以使用類從org.apache.hadoop.mapreduce命名空間（可以使用舊mapred使用非常類似的方法，只是檢查API文檔）在Java應用程序中：

Job job = Job.getInstance(new Configuration()); 
// configure job: set input and output types and directories, etc. 

job.setJarByClass(MapReduceCassandra.class); 
job.submit();

傳遞數據的MapReduce工作

如果你的行鍵的設定是真實的LY小，你可以把它序列化到一個字符串，然後將其作爲一個配置參數：

job.getConfiguration().set("CassandraRows", getRowsKeysSerialized()); // TODO: implement serializer 

//... 

job.submit();

n面的工作，你就可以通過上下文對象來訪問參數：

public void map(
    IntWritable key, // your key type 
    Text value,  // your value type 
    Context context 
) 
{ 
    // ... 

    String rowsSerialized = context.getConfiguration().get("CassandraRows"); 
    String[] rows = deserializeRows(rowsSerialized); // TODO: implement deserializer 

    //... 
}

但是，如果你的集合可能是無限的，那麼將它作爲參數傳遞將是一個壞主意。相反，您應該將密鑰傳遞給文件，並利用分佈式緩存。然後你可以在此行只是添加到上面的部分，你提交作業前：

job.addCacheFile(new Path(pathToCassandraKeySetFile).toUri()); 

//... 

job.submit();

工作中你就可以通過上下文對象訪問此文件：

public void map(
    IntWritable key, // your key type 
    Text value,  // your value type 
    Context context 
) 
{ 
    // ... 

    URI[] cacheFiles = context.getCacheFiles(); 

    // find, open and read your file here 

    // ... 
}

注：所有這些都是針對新的API（org.apache.hadoop.mapreduce）。如果您使用org.apache.hadoop.mapred這種方法非常相似，但是會在不同的對象上調用一些相關的方法。

來源

2014-02-20 21:21:07

謝謝丹尼爾S.它的工作。 – Anudeep

如何將多個輸入格式文件傳遞給map-reduce作業？

回答

相關問題