2016-12-21 293 views
1

我試圖在訪問Hive表的Cloud 4.2 Enterprise上的BigInsights上運行pyspark腳本。Spark Hive報告pyspark.sql.utils.AnalysisException:u'Table not found:XXX'在紗線羣集上運行時

首先,我創建的蜂巢表:

[[email protected] ~]$ hive 
hive> CREATE TABLE pokes (foo INT, bar STRING); 
OK 
Time taken: 2.147 seconds 
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; 
Loading data to table default.pokes 
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0] 
OK 
Time taken: 0.49 seconds 
hive> 

然後,我創建一個簡單的pyspark腳本:

[[email protected] ~]$ cat test_pokes.py 
from pyspark import SparkContext 

sc = SparkContext() 

from pyspark.sql import HiveContext 
hc = HiveContext(sc) 

pokesRdd = hc.sql('select * from pokes') 
print(pokesRdd.collect()) 

我嘗試與執行:

[[email protected] ~]$ spark-submit \ 
    --master yarn-cluster \ 
    --deploy-mode cluster \ 
    --jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \ 
      /usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \ 
      /usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \ 
    test_pokes.py 

不過,我遇到錯誤:

Traceback (most recent call last): 
    File "test_pokes.py", line 8, in <module> 
    pokesRdd = hc.sql('select * from pokes') 
    File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql 
    File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ 
    File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco 
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14' 
End of LogType:stdout 

如果我運行火花提交獨立的,我可以看到表中存在確定:

[[email protected] ~]$ spark-submit test_pokes.py 
… 
… 
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver 
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1) 
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s 
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s 
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311') 
… 
… 

見與此相關的問題我剛纔的問題:hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

這個問題是類似於此的其他問題: Spark can access Hive table from pyspark but not from spark-submit。但是,不像那個問題我使用HiveContext。


更新:在這裏看到最終的解決方案https://stackoverflow.com/a/41272260/1033422

回答

4

這是因爲火花提交的工作是無法找到hive-site.xml,所以它不能連接到蜂巢metastore。請將--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml添加到您的spark-submit命令中。

+0

這並不能解釋爲什麼它在獨立模式下工作 –

+0

這讓我更進一步。我現在收到錯誤:'MetaException(message:Failed to instance listener named:com.ibm.biginsights.bigsql.sync.BIEventListener,reason:java.lang.ClassNotFoundException:com.ibm.biginsights.bigsql.sync.BIEventListener)' –

+0

對不起,我應該解釋一下。如果您獨立運行,則驅動程序會在機器上運行。因此它會從本地類路徑中選取'hive-site.xml'。如果你運行在'cluster-mode'上,這個xml文件不會被傳輸到集羣上的容器,所以你必須手工指定它,Spark會把它放到你的類路徑中。 –

2

它看起來像你受到這個bug的影響:https://issues.apache.org/jira/browse/SPARK-15345



我在HDP-2.5.0.0有過類似的問題,與星火1.6.2和2.0.0:
我的目標是創建一個從蜂巢SQL查詢數據幀,在這些條件下:

  • 蟒API,
  • 集羣部署模式
  • 使用紗(在執行節點中的一個上運行的驅動程序)來管理執行的JVM(而不是一個獨立的Spark主實例)。

初始測試給出以下結果:

  1. spark-submit --deploy-mode client --master local ... => WORKING
  2. spark-submit --deploy-mode client --master yarn ... =>WORKING
  3. spark-submit --deploy-mode cluster --master yarn ...。=>NOT WORKING

在情況#3中,在其中一個執行程序節點上運行的驅動程序可以找到數據庫。錯誤是:上面所列

pyspark.sql.utils.AnalysisException: 'Table or view not found: `database_name`.`table_name`; line 1 pos 14' 



Fokko Driesprong的回答爲我工作。
隨着,下面列出的命令,執行器節點上運行的驅動程序能夠訪問在數據庫中的配置單元表這是不default

$ /usr/hdp/current/spark2-client/bin/spark-submit \ 
--deploy-mode cluster --master yarn \ 
--files /usr/hdp/current/spark2-client/conf/hive-site.xml \ 
/path/to/python/code.py 



的Python代碼我已經使用與星火1.6.2測試和星火2.0.0: (更改SPARK_VERSION 1星火1.6.2測試確保更新相應的火花提交命令的路徑。)

SPARK_VERSION=2  
APP_NAME = 'spark-sql-python-test_SV,' + str(SPARK_VERSION) 



def spark1(): 
    from pyspark.sql import HiveContext 
    from pyspark import SparkContext, SparkConf 

    conf = SparkConf().setAppName(APP_NAME) 
    sc = SparkContext(conf=conf) 
    hc = HiveContext(sc) 

    query = 'select * from database_name.table_name limit 5' 
    df = hc.sql(query) 
    printout(df) 




def spark2(): 
    from pyspark.sql import SparkSession 
    spark = SparkSession.builder.appName(APP_NAME).enableHiveSupport().getOrCreate() 
    query = 'select * from database_name.table_name limit 5' 
    df = spark.sql(query) 
    printout(df) 




def printout(df): 
    print('\n########################################################################') 
    df.show() 
    print(df.count()) 

    df_list = df.collect() 
    print(df_list) 
    print(df_list[0]) 
    print(df_list[1]) 
    print('########################################################################\n') 




def main(): 
    if SPARK_VERSION == 1: 
     spark1() 
    elif SPARK_VERSION == 2: 
     spark2() 




if __name__ == '__main__': 
    main() 
相關問題