2015-11-05 91 views
7

產生我有一個問題,閱讀在蜂巢火花產生分區文件拼花地板分區的文件。我可以在配置單元中創建外部表,但是當我嘗試選擇幾行時,配置單元僅返回沒有行的「確定」消息。蜂房不讀火花

我能夠在星火正確讀取分區拼花文件,所以我假設他們是正確生成。 我也能當我在蜂巢創建外部表不分區來讀取這些文件。

有沒有人有建議?

我的環境是:

  • 集羣EMR 4.1.0
  • 蜂巢1.0.0
  • 星火1.5.0
  • 色相3.7.1
  • 平面文件存儲在一個S3桶(S3://分期-開發/測試/ ttfourfieldspart2 /年= 2013 /月= 11)

我的火花配置文件具有以下參數(/etc/spark/conf.dist/spark-defaults.conf):

spark.master yarn 
spark.driver.extraClassPath /etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/* 
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native 
spark.executor.extraClassPath /etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/* 
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native 
spark.eventLog.enabled true 
spark.eventLog.dir hdfs:///var/log/spark/apps 
spark.history.fs.logDirectory hdfs:///var/log/spark/apps 
spark.yarn.historyServer.address ip-10-37-161-246.ec2.internal:18080 
spark.history.ui.port 18080 
spark.shuffle.service.enabled true 
spark.driver.extraJavaOptions -Dlog4j.configuration=file:///etc/spark/conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=512M -XX:OnOutOfMemoryError='kill -9 %p' 
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p' 
spark.executor.memory 4G 
spark.driver.memory 4G 
spark.dynamicAllocation.enabled true 
spark.dynamicAllocation.maxExecutors 100 
spark.dynamicAllocation.minExecutors 1 

蜂房配置文件具有以下參數(在/ etc /蜂巢/ CONF /蜂巢-site.xml中):

<configuration> 

<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files --> 
<!-- that are implied by Hadoop setup variables.            --> 
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive --> 
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized --> 
<!-- resource).                     --> 

<!-- Hive Execution Parameters --> 


<property> 
    <name>hbase.zookeeper.quorum</name> 
    <value>ip-10-xx-xxx-xxx.ec2.internal</value> 
    <description>http://wiki.apache.org/hadoop/Hive/HBaseIntegration</description> 
</property> 

<property> 
    <name>hive.execution.engine</name> 
    <value>mr</value> 
</property> 

    <property> 
    <name>fs.defaultFS</name> 
    <value>hdfs://ip-10-xx-xxx-xxx.ec2.internal:8020</value> 
    </property> 

<property> 
    <name>hive.metastore.uris</name> 
    <value>thrift://ip-10-xx-xxx-xxx.ec2.internal:9083</value> 
    <description>JDBC connect string for a JDBC metastore</description> 
</property> 

<property> 
    <name>javax.jdo.option.ConnectionURL</name> 
    <value>jdbc:mysql://ip-10-xx-xxx-xxx.ec2.internal:3306/hive?createDatabaseIfNotExist=true</value> 
    <description>username to use against metastore database</description> 
</property> 

<property> 
    <name>javax.jdo.option.ConnectionDriverName</name> 
    <value>org.mariadb.jdbc.Driver</value> 
    <description>username to use against metastore database</description> 
</property> 

<property> 
    <name>javax.jdo.option.ConnectionUserName</name> 
    <value>hive</value> 
    <description>username to use against metastore database</description> 
</property> 

<property> 
    <name>javax.jdo.option.ConnectionPassword</name> 
    <value>1R72JFCDG5XaaDTB</value> 
    <description>password to use against metastore database</description> 
</property> 

    <property> 
    <name>datanucleus.fixedDatastore</name> 
    <value>true</value> 
    </property> 

    <property> 
    <name>mapred.reduce.tasks</name> 
    <value>-1</value> 
    </property> 

    <property> 
    <name>mapred.max.split.size</name> 
    <value>256000000</value> 
    </property> 

    <property> 
    <name>hive.metastore.connect.retries</name> 
    <value>5</value> 
    </property> 

    <property> 
    <name>hive.optimize.sort.dynamic.partition</name> 
    <value>true</value> 
    </property> 

    <property><name>hive.exec.dynamic.partition</name><value>true</value></property> 
    <property><name>hive.exec.dynamic.partition.mode</name><value>nonstrict</value></property> 
    <property><name>hive.exec.max.dynamic.partitions</name><value>10000</value></property> 
    <property><name>hive.exec.max.dynamic.partitions.pernode</name><value>500</value></property> 

</configuration> 

我的Python代碼讀取分區拼花文件:

from pyspark import * 
from pyspark.sql import * 
from pyspark.sql.types import * 
from pyspark.sql.functions import * 

df7 = sqlContext.read.parquet('s3://staging-dev/test/ttfourfieldspart2/') 

火花打印的拼花文件架構:

>>> df7.schema 
StructType(List(StructField(transactionid,StringType,true),StructField(eventts,TimestampType,true),StructField(year,IntegerType,true),StructField(month,IntegerType,true))) 

>>> df7.printSchema() 
root 
|-- transactionid: string (nullable = true) 
|-- eventts: timestamp (nullable = true) 
|-- year: integer (nullable = true) 
|-- month: integer (nullable = true) 

>>> df7.show(10) 
+--------------------+--------------------+----+-----+ 
|  transactionid|    eventts|year|month| 
+--------------------+--------------------+----+-----+ 
|f7018907-ed3d-49b...|2013-11-21 18:41:...|2013| 11| 
|f6d95a5f-d4ba-489...|2013-11-21 18:41:...|2013| 11| 
|02b2a715-6e15-4bb...|2013-11-21 18:41:...|2013| 11| 
|0e908c0f-7d63-48c...|2013-11-21 18:41:...|2013| 11| 
|f83e30f9-950a-4b9...|2013-11-21 18:41:...|2013| 11| 
|3425e4ea-b715-476...|2013-11-21 18:41:...|2013| 11| 
|a20a6aeb-da4f-4fd...|2013-11-21 18:41:...|2013| 11| 
|d2f57e6f-889b-49b...|2013-11-21 18:41:...|2013| 11| 
|46f2eda5-408e-44e...|2013-11-21 18:41:...|2013| 11| 
|36fb8b79-b2b5-493...|2013-11-21 18:41:...|2013| 11| 
+--------------------+--------------------+----+-----+ 
only showing top 10 rows 

在蜂巢創建表:

create external table if not exists t3(
    transactionid string, 
    eventts timestamp) 
partitioned by (year int, month int) 
stored as parquet 
location 's3://staging-dev/test/ttfourfieldspart2/'; 

當我嘗試選擇在蜂巢中的某些行,這不返回任何行:

hive> select * from t3 limit 10; 
OK 
Time taken: 0.027 seconds 
hive> 

回答

9

我終於找到了問題。當您在蜂房,其中分區的數據已經存在於S3或HDFS表,你需要運行一個命令與表的分區結構更新蜂巢Metastore。看看這裏: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

The commands are: 

MSCK REPAIR TABLE table_name; 


And on Hive running in Amazon EMR you can use: 

ALTER TABLE table_name RECOVER PARTITIONS; 
+0

這也適用於我。全新的表,並選擇之前將返回任何數據,不得不修復它... K ...謝謝! – jhnclvr