2017-02-21 80 views
1

我想通過使用JAVA的Spark訪問HBase。除了this之外,我還沒有找到任何例子。在回答寫入,通過使用Spark和JAVA從HBase讀取數據

您也可以在Java

寫這篇文章我複製從How to read from hbase using spark驗證碼:

import org.apache.hadoop.hbase.client.{HBaseAdmin, Result} 
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor } 
import org.apache.hadoop.hbase.mapreduce.TableInputFormat 
import org.apache.hadoop.hbase.io.ImmutableBytesWritable 

import org.apache.spark._ 

object HBaseRead { 
    def main(args: Array[String]) { 
    val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]") 
    val sc = new SparkContext(sparkConf) 
    val conf = HBaseConfiguration.create() 
    val tableName = "table1" 

    System.setProperty("user.name", "hdfs") 
    System.setProperty("HADOOP_USER_NAME", "hdfs") 
    conf.set("hbase.master", "localhost:60000") 
    conf.setInt("timeout", 120000) 
    conf.set("hbase.zookeeper.quorum", "localhost") 
    conf.set("zookeeper.znode.parent", "/hbase-unsecure") 
    conf.set(TableInputFormat.INPUT_TABLE, tableName) 

    val admin = new HBaseAdmin(conf) 
    if (!admin.isTableAvailable(tableName)) { 
     val tableDesc = new HTableDescriptor(tableName) 
     admin.createTable(tableDesc) 
    } 

    val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) 
    println("Number of Records found : " + hBaseRDD.count()) 
    sc.stop() 
    } 
} 

誰能給我一些提示如何找到正確的依賴關係,對象和東西?

看起來好像HBaseConfigurationhbase-client,但我實際上堅持在TableInputFormat.INPUT_TABLE。這不應該在相同的依賴?

有更好的方式來訪問hbase的火花嗎?

回答

0

TableInputFormat類在hbase-server.jar中,您需要在您的pom.xml中添加該依賴項。請在Spark用戶列表中檢查HBase and non-existent TableInputFormat

<dependency> 
    <groupId>org.apache.hbase</groupId> 
    <artifactId>hbase-server</artifactId> 
    <version>1.3.0</version> 
</dependency> 

下面是使用Spark從Hbase讀取的示例代碼。

public static void main(String[] args) throws Exception { 
    SparkConf sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[*]"); 
    JavaSparkContext jsc = new JavaSparkContext(sparkConf); 
    Configuration hbaseConf = HBaseConfiguration.create(); 
    hbaseConf.set(TableInputFormat.INPUT_TABLE, "my_table"); 
    JavaPairRDD<ImmutableBytesWritable, Result> javaPairRdd = jsc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class,ImmutableBytesWritable.class, Result.class); 
    jsc.stop(); 
    } 
} 
+0

感謝您的示例代碼,我要測試這 – monti

0

是的。有。使用Cloudera的SparkOnHbase

<dependency> 
    <groupId>org.apache.hbase</groupId> 
    <artifactId>hbase-spark</artifactId> 
    <version>1.2.0-cdh5.7.0</version> 
</dependency> 

而且使用HBase的掃描,從你讀數據HBase的表(或批量get如果你知道你想要檢索行的鍵)。

Configuration conf = HBaseConfiguration.create(); 
conf.addResource(new Path("/etc/hbase/conf/core-site.xml")); 
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")); 
JavaHBaseContext hbaseContext = new JavaHBaseContext(jsc, conf); 

Scan scan = new Scan(); 
scan.setCaching(100); 

JavaRDD<Tuple2<byte[], List<Tuple3<byte[], byte[], byte[]>>>> hbaseRdd = hbaseContext.hbaseRDD(tableName, scan); 

System.out.println("Number of Records found : " + hBaseRDD.count()) 
+0

抱歉,最近的答案,我只是試圖添加'hbase-spark'。我意識到artefact-id不在maven中央。因此,我將'https:// repository.cloudera.com/artifactory/cloudera-repos /'添加爲pom作爲存儲庫。它仍然說'POM for org.apache.hbase:hbase-spark:jar:1.2.0-cdh5.7.0缺失,沒有可用的依賴信息,任何建議? – monti

相關問題