將大型Hbase表加載到SPARK RDD需要很長時間

我試圖將一個大的Hbase表加載到SPARK RDD中以在實體上運行SparkSQL查詢。對於有大約600萬行的實體，將需要大約35秒才能將其加載到RDD。是否預計？有什麼方法可以縮短加載過程嗎？我一直在從http://hbase.apache.org/book/perf.reading.html獲得一些提示，以加快此過程，例如， scan.setCaching（cacheSize），並且只添加必要的屬性/列進行掃描。我只是想知道是否有其他方法來提高速度？將大型Hbase表加載到SPARK RDD需要很長時間

這裏是代碼片段：

SparkConf sparkConf = new SparkConf().setMaster("spark://url").setAppName("SparkSQLTest"); 
JavaSparkContext jsc = new JavaSparkContext(sparkConf); 
Configuration hbase_conf = HBaseConfiguration.create(); 
hbase_conf.set("hbase.zookeeper.quorum","url"); 
hbase_conf.set("hbase.regionserver.port", "60020"); 
hbase_conf.set("hbase.master", "url"); 
hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName); 
Scan scan = new Scan(); 
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col1")); 
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col2")); 
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col3")); 
scan.setCaching(this.cacheSize); 
hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan)); 
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD 
= jsc.newAPIHadoopRDD(hbase_conf, 
      TableInputFormat.class, ImmutableBytesWritable.class, 
      Result.class); 
logger.info("count is " + hBaseRDD.cache().count());

來源

2014-12-04 bonnahu

根據您的簇大小和行的大小（列和列的家庭，以及您的區域分割），它可能會有所不同 - 但沒有按聽起來不合理。考慮每秒有多少行:) :)

來源

2015-10-07 21:00:48 JoeC

將大型Hbase表加載到SPARK RDD需要很長時間

回答

相關問題