從HBase的

出口如何讀取序列文件我用下面的代碼輸出導出HBase的表並保存到HDFS：從HBase的

hbase org.apache.hadoop.hbase.mapreduce.Export \ 
MyHbaseTable1 hdfs://nameservice1/user/ken/data/exportTable1

輸出文件是二進制文件。如果我使用pyspark讀取文件夾：

test1 = sc.textFile('hdfs://nameservice1/user/ken/data/exportTable1') 
test1.show(5)

它顯示：

u'SEQ\x061org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result\x00\x00\x00\x00\x00\x00\ufffd-\x10A\ufffd~lUE\u025bt\ufffd\ufffd\ufffd&\x00\x00\x04\ufffd\x00\x00\x00' 
u'\x00\x00\x00\x067-2010\ufffd\t' 
u'|' 
u'\x067-2010\x12\x01r\x1a\x08clo-0101 \ufffd\ufffd\ufffd*(\x042\\6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0' 
u'u'

我可以告訴大家，在2號線

'7-2010'是Rowkey，
'R'是列家族，
'CLO-0101'在第四行是列名，
「6.67 | 10 | 10 | 10 | 7.33 | 6.67 | 6.67 | 6.67 | 6.67 | 6.67 | 6.67 | 5.83 | 3.17 | 0 | 0 | 0.67 | 0.67 | 0.67 | 0.67 | 0 | 0 | 0 | 0 | 0'是值。

我不知道在哪裏3和第5行是從哪裏來的。看起來Hbase-export遵循自己的規則來生成文件，如果我用我自己的方式來解碼它，數據可能會被損壞。

問：

我如何才能將此文件轉換回可讀格式？例如：

7-2010, r, clo-0101, 6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0

我曾嘗試：

test1 = sc.sequenceFile('/user/youyang/data/hbaseSnapshot1/', keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0) 
test1.show(5)

和

test1 = sc.sequenceFile('hdfs://nameservice1/user/ken/data/exportTable1' 
      , keyClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat' 
      , valueClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable' 
      , keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter' 
      , valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringCon verter' 
      , minSplits=None 
      , batchSize=100)

沒有運氣，代碼沒有工作，ERROR：

Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.

有什麼建議？謝謝！

來源

2016-07-08 kennyut

我最近自己有這個問題。我通過遠離sc.sequenceFile解決了這個問題，而使用sc.newAPIHadoopFile（或者如果您使用的是舊API，則只是hadoopFile）。 Spark SequenceFile閱讀器似乎只處理可寫類型的鍵/值（在docs中說明）。

如果使用newAPIHadoopFile它使用了Hadoop的反序列化邏輯，你可以指定你在config-詞典需要哪個序列化類型，您給它：

hadoop_conf = {"io.serializations": "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization"} 

sc.newAPIHadoopFile(
<input_path>, 
'org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat', 
keyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable', 
valueClass='org.apache.hadoop.hbase.client.Result', 
keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter', 
valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter', 
conf=hadoop_conf)

注意，對於「IO在hadoop_conf值。 serializations「是一個包含」org.apache.hadoop.hbase.mapreduce.ResultSerialization「的逗號分隔列表。這是您需要能夠反序列化結果的關鍵配置。爲了能夠反序列化ImmutableBytesWritable，還需要WritableSerialization。

您也可以使用sc.newAPIHadoopRDD，但是您還需要在config字典中爲「mapreduce.input.fileinputformat.inputdir」設置一個值。

來源

2016-10-25 13:35:19

回答

相關問題