來自Apache Spark的外部配置單元中的查詢表格

我對hadoop生態系統比較陌生。我的目標是使用Apache Spark讀取配置表格並處理它。 Hive正在EC2實例中運行。而Spark正在我的本地機器上運行。來自Apache Spark的外部配置單元中的查詢表格

要做一個原型，我已經安裝了Apache Hadoop，其步驟如下：here。我也添加了必需的環境變量。我已經開始使用dfs $HADOOP_HOME/sbin/start-dfs.sh

我已經安裝了Apache Hive，通過以下步驟通過here。我已經開始hiverserver2和蜂巢元數據存儲。我在配置單元中配置了Apache Derby db（服務器模式）。我創建了一個示例表'web_log'，並使用直線添加了幾行。

我在Hadoop的核心-site.xml中

<property> 
    <name>fs.defaultFS</name> 
    <value>hdfs://localhost:9000</value> 
    </property>

在HDFS-site.xml中添加以下及以下添加

<property> 
     <name>dfs.client.use.datanode.hostname</name> 
     <value>true</value> 
</property>

我已經添加了核心-site.xml中，HDFS $ SPARK_HOME/conf中的-site.xml和hive-site.xml位於本地spark實例中

core-site.xml和hdfs-site.xml爲空。即

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<configuration> 
</configuration>

蜂房的site.xml低於含量

<configuration> 
    <property> 
    <name>hive.metastore.uris</name> 
    <value>thrift://ec2-instance-external-dbs-name:9083</value> 
    <description>URI for client to contact metastore server</description> 
    </property> 
</configuration>

我已經開始火花外殼並執行以下命令

scala> sqlContext 
res0: org.apache.spark.sql.SQLContext = [email protected]

看來火花創造HiveContext。我已經使用以下命令

scala> val df = sqlContext.sql("select * from web_log") 
df: org.apache.spark.sql.DataFrame = [viewtime: int, userid: bigint, url: string, referrer: string, ip: string]

的列執行的SQL和其類型，我已經創建的示例表「web_log」匹配。現在，當我執行scala> df.show，它花了一些時間，並拋出以下錯誤

16/11/21 18:46:17 WARN BlockReaderFactory: I/O error constructing remote block reader. 
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ec2-instance-private-ip:50010] 
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) 
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3101) 
    at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)

看來DFSClient使用EC2實例內部IP。而且AFAIK，我沒有在50010端口啓動任何應用程序。

我是否需要安裝並啓動任何其他應用程序？

如何確保DFSClient使用EC2實例的外部IP或外部DNS名稱？

是否可以從外部火花實例訪問配置單元？

來源

2016-11-21 sag

添加下面的代碼片段來計劃你正在運行，

hiveContext.getConf.getAll.mkString("\n")這將打印的蜂巢metastore其連接到......你可以查看所有這些都是不正確的屬性。

如果他們不是你正在尋找的，你不能調整... 由於一些限制，然後描述link。你可以嘗試這樣指向正確的uris ...etc

hiveContext.setConf("hive.metastore.uris", "thrift://METASTOREl:9083");

來源

2016-11-21 16:31:57

它幫助擺脫了問題中發佈的錯誤。現在我在hiveContext中設置hive.metastore.uris。但現在，我得到這個錯誤'''java.net.ConnectException：從德里/ 127.0.1.1調用到本地：9000連接失敗例外：java.net.ConnectException：連接被拒絕;有關更多詳細信息，請參閱http://wiki.apache.org/hadoop/ConnectionRefused \t at sun.reflect.NativeConstructorAccessorImpl.newInstance0（Native Method）'''。看起來spark是嘗試使用localhost訪問hdfs。我已經嘗試在hiveContext中設置fs.defaultFS，但沒有用。請幫助我 – sag

上面的錯誤是因爲你無法連接到指定的主機 –

但是爲什麼它試圖訪問本地主機，雖然我設置了metastore uri的EC2實例？如何配置使用EC2實例主機？ – sag

來自Apache Spark的外部配置單元中的查詢表格

回答

相關問題