2016-10-06 47 views
1

我是Spark的新手,我嘗試使用Spark shell(Python和Scala版本),但在嘗試一個簡單示例時遇到異常。這裏的輸入:無法使用Spark外殼

text = sc.textFile("README.md") 
text.count() #fails 

這裏的例外:

java.net.ConnectException: Call From marko/127.0.1.1 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 
    at java.lang.reflect.Constructor.newInstance(Constructor.java:408) 
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1479) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1412) 
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) 
    at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source) 
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:483) 
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) 
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) 
    at com.sun.proxy.$Proxy21.getFileInfo(Unknown Source) 
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108) 
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) 
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) 
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) 
    at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57) 
    at org.apache.hadoop.fs.Globber.glob(Globber.java:252) 
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1674) 
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259) 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) 
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) 
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) 
    at scala.Option.getOrElse(Option.scala:121) 
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911) 
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) 
    at org.apache.spark.rdd.RDD.collect(RDD.scala:892) 
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) 
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:483) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
    at py4j.Gateway.invoke(Gateway.java:280) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:211) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: java.net.ConnectException: Connection refused 
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) 
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) 
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) 
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) 
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614) 
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712) 
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) 
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1451) 
    ... 56 more 

我猜,有一些問題,配置和客戶端試圖連接到HDFS集羣,但我沒有找到任何我應該在運行shell之前更改配置或傳遞參數。該怎麼辦?

回答

1

here之前已經回答了類似的問題。改爲嘗試file:///path_to_file

根據答案,

SparkContext.textFile內部調用org.apache.hadoop.mapred.FileInputFormat.getSplits,又使用org.apache.hadoop.fs.getDefaultUri如果模式不存在。該方法讀取Hadoop conf的「fs.defaultFS」參數。如果您設置了HADOOP_CONF_DIR環境變量,則該參數通常設置爲「hdfs:// ...」;否則「file://」。