我試圖在訪問Hive表的Cloud 4.2 Enterprise上的BigInsights上運行pyspark腳本。Spark Hive報告pyspark.sql.utils.AnalysisException:u'Table not found:XXX'在紗線羣集上運行時
首先,我創建的蜂巢表:
[[email protected] ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
然後,我創建一個簡單的pyspark腳本:
[[email protected] ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print(pokesRdd.collect())
我嘗試與執行:
[[email protected] ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
不過,我遇到錯誤:
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
End of LogType:stdout
如果我運行火花提交獨立的,我可以看到表中存在確定:
[[email protected] ~]$ spark-submit test_pokes.py
…
…
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1)
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311')
…
…
見與此相關的問題我剛纔的問題:hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
這個問題是類似於此的其他問題: Spark can access Hive table from pyspark but not from spark-submit。但是,不像那個問題我使用HiveContext。
更新:在這裏看到最終的解決方案https://stackoverflow.com/a/41272260/1033422
這並不能解釋爲什麼它在獨立模式下工作 –
這讓我更進一步。我現在收到錯誤:'MetaException(message:Failed to instance listener named:com.ibm.biginsights.bigsql.sync.BIEventListener,reason:java.lang.ClassNotFoundException:com.ibm.biginsights.bigsql.sync.BIEventListener)' –
對不起,我應該解釋一下。如果您獨立運行,則驅動程序會在機器上運行。因此它會從本地類路徑中選取'hive-site.xml'。如果你運行在'cluster-mode'上,這個xml文件不會被傳輸到集羣上的容器,所以你必須手工指定它,Spark會把它放到你的類路徑中。 –