PySpark投擲錯誤方法__getnewargs __（[]）不存在

我有一組文件。文件路徑保存在文件中，例如「all_files.txt」。使用apache spark，我需要對所有文件進行操作並將結果分發。PySpark投擲錯誤方法__getnewargs __（[]）不存在

，我想要做的步驟是：

通過閱讀「all_files.txt」
對於「all_files.txt」每一行創建一個RDD（每行是一些文件的路徑），讀取每個文件的內容到一個單一的RDD
然後再做一次手術的所有內容

這是我寫的同一代碼：

def return_contents_from_file (file_name): 
    return spark.read.text(file_name).rdd.map(lambda r: r[0]) 

def run_spark(): 
    file_name = 'path_to_file' 

    spark = SparkSession \ 
     .builder \ 
     .appName("PythonWordCount") \ 
     .getOrCreate() 

    counts = spark.read.text(file_name).rdd.map(lambda r: r[0]) \ # this line is supposed to return the paths to each file 
     .flatMap(return_contents_from_file) \ # here i am expecting to club all the contents of all files 
     .flatMap(do_operation_on_each_line_of_all_files) # here i am expecting do an operation on each line of all files

這是引發錯誤：

line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o25.getnewargs. Trace: py4j.Py4JException: Method getnewargs([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)

有人能告訴我什麼，我做錯了，我應該如何進一步進行。提前致謝。

來源

2016-11-07 UnderWood

在flatMap內部使用spark，或者不允許在執行者上發生的任何轉換（spark會話僅在驅動程序上可用）。它也不可能創造RDDS的RDD（見：Is it possible to create nested RDDs in Apache Spark?）

但你可以以另一種方式實現這種轉變 - 讀all_files.txt到數據幀的所有內容，使用當地map使他們dataframes和當地reduce以union all，請參閱示例：

>>> filenames = spark.read.text('all_files.txt').collect() 
>>> dataframes = map(lambda r: spark.read.text(r[0]), filenames) 
>>> all_lines_df = reduce(lambda df1, df2: df1.unionAll(df2), dataframes)

來源

2016-11-07 17:33:42 Mariusz

謝謝您的回覆。但是，我如何平行整個過程呢？不會映射（lambda r：spark.read.text（r [0]），文件名）序列化整個過程？ – UnderWood

閱讀文件的過程並行運行，唯一的序列化部分是構建執行計劃。試試看！ – Mariusz

PySpark投擲錯誤方法getnewargs （[]）不存在

回答

PySpark投擲錯誤方法__getnewargs __（[]）不存在

回答

相關問題

PySpark投擲錯誤方法getnewargs （[]）不存在