MapReduce以文件名作爲關鍵字，內容作爲值，很多小文件

我看過FileInputFormat where filename is KEY and text contents are VALUE,How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?和Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job，但我在起步時遇到了問題。之前沒有對Hadoop做過任何事情，如果其他人看到我犯了錯誤，我會警惕開始走錯路。MapReduce以文件名作爲關鍵字，內容作爲值，很多小文件

我有一個目錄包含一些像100K小文件包含HTML，我想創建一個倒排索引使用Amazon Elastic MapReduce，在Java中實現。一旦我有文件內容，我知道我想要我的地圖，並減少功能。

看看here後，我的理解是我需要繼承FileInputFormat並覆蓋isSplitable。但是，我的文件名與HTML來自的URL相關，所以我想保留它們。用文本替換NullWritable我需要做什麼？任何其他建議？

來源

2015-12-07 kcmgrew

您應該使用WholeFileInputFormat整個文件傳遞到您的映射

conf.setInputFormat(WholeFileInputFormat.class); 
conf.setOutputFormat(TextOutputFormat.class); 
FileInputFormat.setInputPaths(conf,new Path("input")); 
FileOutputFormat.setOutputPath(conf,new Path("output"));

來源

2015-12-07 08:47:07

MapReduce以文件名作爲關鍵字，內容作爲值，很多小文件

回答

相關問題