如何將.txt文件轉換爲Hadoop的序列文件格式

要有效利用Hadoop中的map-reduce作業，我需要將數據存儲在hadoop's sequence file format中。然而，目前的數據只是平坦的.txt格式。任何人都可以提出一種方法，我可以將.txt文件轉換爲序列文件？如何將.txt文件轉換爲Hadoop的序列文件格式

來源

2011-03-21 Abhishek Pathak

所以最簡單的答案只是一個具有SequenceFile輸出的「身份」作業。

看起來這在java中：

public static void main(String[] args) throws IOException, 
     InterruptedException, ClassNotFoundException { 

    Configuration conf = new Configuration(); 
    Job job = new Job(conf); 
    job.setJobName("Convert Text"); 
    job.setJarByClass(Mapper.class); 

    job.setMapperClass(Mapper.class); 
    job.setReducerClass(Reducer.class); 

    // increase if you need sorting or a special number of files 
    job.setNumReduceTasks(0); 

    job.setOutputKeyClass(LongWritable.class); 
    job.setOutputValueClass(Text.class); 

    job.setOutputFormatClass(SequenceFileOutputFormat.class); 
    job.setInputFormatClass(TextInputFormat.class); 

    TextInputFormat.addInputPath(job, new Path("/lol")); 
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz")); 

    // submit and wait for completion 
    job.waitForCompletion(true); 
    }

來源

2011-03-21 18:58:20

所以，如果我有100個.txt文件這會給我100個.SEQ文件，對不對？如果我想要一個大的.seq文件怎麼辦？ – dranxo 2012-08-03 23:00:35

+10

我猜測：job.setNumReduceTasks（1）; – dranxo 2012-08-03 23:07:15

@rcompton是完全相同 – 2012-08-04 08:16:01

這取決於TXT文件的格式是什麼。每記錄一行嗎？如果是這樣，你可以簡單地使用TextInputFormat，它爲每一行創建一條記錄。在您的映射器中，您可以解析該行並根據您的選擇使用它。

如果它不是每條記錄一行，則可能需要編寫自己的InputFormat實現。看看this tutorial瞭解更多信息。

來源

2011-03-21 13:27:48 bajafresh4life

如果你的數據是不是在HDFS，你需要把它上傳到HDFS。兩個選項：

i）hdfs -put在你的.txt文件上，一旦你在HDFS上得到它，你可以將它轉換爲seq文件。 ii）在HDFS客戶端框中輸入文本文件作爲輸入，並通過創建SequenceFile.Writer並向其添加（鍵值）來使用序列文件API將其轉換爲SeqFile。

如果你不關心鍵，U可以使行號碼作爲關鍵和完整的文本價值。

來源

2011-03-31 19:03:46 user656189

我需要使用第一個選項。我該怎麼做？ – zohar 2012-01-23 15:00:11

你也可以創建一箇中間表，LOAD DATA的CSV內容直入，然後創建第二個表作爲sequencefile（分區，集羣，等..），並從中間表中插入選擇。您還可以設置壓縮，例如選項，然後

set hive.exec.compress.output = true; 
set io.seqfile.compression.type = BLOCK; 
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec; 

create table... stored as sequencefile; 

insert overwrite table ... select * from ...;

的MR框架會照顧heavylifting的你，因此您不必編寫Java代碼的麻煩。

來源

2012-01-17 15:39:59 Mario

import java.io.IOException; 
import java.net.URI; 

import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IOUtils; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.SequenceFile; 
import org.apache.hadoop.io.Text; 

//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 

public class SequenceFileWriteDemo { 

    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" }; 

    public static void main(String[] args) throws IOException { 
     String uri = args[ 0]; 
     Configuration conf = new Configuration(); 
     FileSystem fs = FileSystem.get(URI.create(uri), conf); 
     Path path = new Path(uri); 
     IntWritable key = new IntWritable(); 
     Text value = new Text(); 
     SequenceFile.Writer writer = null; 
     try { 
      writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass()); 
      for (int i = 0; i < 100; i ++) { 
       key.set(100 - i); 
       value.set(DATA[ i % DATA.length]); 
       System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
       writer.append(key, value); } 
     } finally 
     { IOUtils.closeStream(writer); 
     } 
    } 
}

來源

2012-08-30 18:17:29

不錯的簡單例子！ – user249654 2013-05-23 06:32:02

uri在這裏是什麼？ – 2014-02-07 04:41:07

，如果你已經安裝了Mahout的 - 它叫做：seqdirectory - 它可以做它

來源

2014-06-02 20:41:26

如何將.txt文件轉換爲Hadoop的序列文件格式

回答

相關問題