用於創建現有HDFS文件的壓縮版本的Java代碼片段。
在一個文本編輯器中,我匆匆建立了一段時間以前寫的一個Java應用程序,因此沒有經過測試;預計會出現一些錯別字和差距。
// HDFS API
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileStatus;
// native Hadoop compression libraries
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.Compressor;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.SnappyCodec;
import org.apache.hadoop.io.compress.Lz4Codec;
..............
// Hadoop "Configuration" (and its derivatives for HDFS, HBase etc.) constructors try to auto-magically
// find their config files by searching CLASSPATH for directories, and searching each dir for hard-coded
// name "core-site.xml", plus "hdfs-site.xml" and/or "hbase-site.xml" etc.
// WARNING - if these config files are not found, the "Configuration" reverts to hard-coded defaults without
// any warning, resulting in bizarre error messages later > let's run some explicit controls here
Configuration cnfHadoop = new Configuration() ;
String propDefaultFs =cnfHadoop.get("fs.defaultFS") ;
if (propDefaultFs ==null || ! propDefaultFs.startsWith("hdfs://"))
{ throw new IllegalArgumentException(
"HDFS configuration is missing - no proper \"core-site.xml\" found, please add\n"
+"directory /etc/hadoop/conf/ (or custom dir with custom XML conf files) in CLASSPATH"
) ;
}
/*
// for a Kerberised cluster, either you already have a valid TGT in the default
// ticket cache (via "kinit"), or you have to authenticate by code
UserGroupInformation.setConfiguration(cnfHadoop) ;
UserGroupInformation.loginUserFromKeytab("[email protected]", "/some/path/to/user.keytab") ;
*/
FileSystem fsCluster =FileSystem.get(cnfHadoop) ;
Path source = new Path("/some/hdfs/path/to/XXX.har") ;
Path target = new Path("/some/hdfs/path/to/XXX.har.gz") ;
// alternative: "BZip2Codec" for better compression (but higher CPU cost)
// alternative: "SnappyCodec" or "Lz4Codec" for lower compression (but much lower CPU cost)
CompressionCodecFactory codecBootstrap = new CompressionCodecFactory(cnfHadoop) ;
CompressionCodec codecHadoop =codecBootstrap.getCodecByClassName(GzipCodec.class.getName()) ;
Compressor compressorHadoop =codecHadoop.createCompressor() ;
byte[] buffer = new byte[16*1024*1024] ;
int bufUsedCapacity ;
InputStream sourceStream =fsCluster.open(source) ;
OutputStream targetStream =codecHadoop.createOutputStream(fsCluster.create(target, true), compressorHadoop) ;
while ((bufUsedCapacity =sourceStream.read(buffer)) >0)
{ targetStream.write(buffer, 0, bufUsedCapacity) ; }
targetStream.close() ;
sourceStream.close() ;
..............
HAR檔案中有什麼樣的內容 - CSV,JSON,非結構化文本(例如日誌),二進制文件?你有沒有考慮將每個HAR分別存檔,將每個文件內部壓縮並重新歸檔?如果不是二進制文件,您是否考慮將每個HAR(或多個HAR)的內容合併爲一個帶有MR或Spark作業的單個GZip(或BZipped)文件?如果結構化,您是否考慮將每個HAR(或多個HAR)的內容合併爲GZip壓縮等柱狀格式,例如Parquet或ORC? –
@SamsonScharfrichter har將包含平面文本文件或鑲木地板文件。沒有像xmls,但我不希望數據分裂。 Gzip每個文件是一個問題,因爲har可能包含350多個目錄,每個目錄內都會有一個文件。我不知道該怎麼做。我嘗試使用PIG使用GZip壓縮來壓縮該單個har文件。它壓縮成功但創建了部分文件,因爲GZip不可拆分,所以這又會是不可取的。最後,不能合併多個HAR,因爲每個har都需要單獨進行gzip壓縮。 – philantrovert