我有一組文件。文件路徑保存在文件中,例如「all_files.txt」。使用apache spark,我需要對所有文件進行操作並將結果分發。PySpark投擲錯誤方法__getnewargs __([])不存在
,我想要做的步驟是:
- 通過閱讀「all_files.txt」
- 對於「all_files.txt」每一行創建一個RDD(每行是一些文件的路徑), 讀取每個文件的內容到一個單一的RDD
- 然後再做一次手術的所有內容
這是我寫的同一代碼:
def return_contents_from_file (file_name):
return spark.read.text(file_name).rdd.map(lambda r: r[0])
def run_spark():
file_name = 'path_to_file'
spark = SparkSession \
.builder \
.appName("PythonWordCount") \
.getOrCreate()
counts = spark.read.text(file_name).rdd.map(lambda r: r[0]) \ # this line is supposed to return the paths to each file
.flatMap(return_contents_from_file) \ # here i am expecting to club all the contents of all files
.flatMap(do_operation_on_each_line_of_all_files) # here i am expecting do an operation on each line of all files
這是引發錯誤:
line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o25.getnewargs. Trace: py4j.Py4JException: Method getnewargs([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)
有人能告訴我什麼,我做錯了,我應該如何進一步進行。提前致謝。
謝謝您的回覆。但是,我如何平行整個過程呢? 不會映射(lambda r:spark.read.text(r [0]),文件名)序列化整個過程? – UnderWood
閱讀文件的過程並行運行,唯一的序列化部分是構建執行計劃。試試看! – Mariusz