如何添加火花數據框中的火花數據框的列（使用Pyspark）？

我有兩個火花數據框，我想從一個火花數據框添加一列到另一個。如何添加火花數據框中的火花數據框的列（使用Pyspark）？

我的代碼是：

new = df.withColumn("prob", tr_df.prob)

在這裏，我想列RESULT2這是tr_df添加到我的數據幀DF命名爲概率。我搜索了這一點，但沒有什麼工作對我來說，我得到一個error--

AnalysisException: u'resolved attribute(s) prob#579 missing from q1_n_words#388L,prediction#510,res1#390,q2_n_words#389L,tfidf_word_match#384,Average#379,prob#385,probability#485,Cosine#381,word_m#383,rawPrediction#461,features#438,res2#391,question1#373,Jaccard#382,test_id#372L,raw_pred#377,question2#374,q2len#376,Common#378L,result2#387,q1len#375,result1#386,Percentage#380 in operator !Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#579 AS prob#634, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, rawPrediction#461, probability#485, prediction#510];;\n!Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#579 AS prob#634, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, rawPrediction#461, probability#485, prediction#510]\n+- Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, rawPrediction#461, probability#485, UDF(rawPrediction#461) AS prediction#510]\n +- Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, rawPrediction#461, UDF(rawPrediction#461) AS probability#485]\n  +- Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, features#438, UDF(features#438) AS rawPrediction#461]\n   +- Project [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391, UDF(struct(q1len#375, q2len#376, cast(q1_n_words#388L as double) AS q1_n_words_double_VectorAssembler_4158baa8e5b4f3aced2b#435, cast(q2_n_words#389L as double) AS q2_n_words_double_VectorAssembler_4158baa8e5b4f3aced2b#436, cast(Common#378L as double) AS Common_double_VectorAssembler_4158baa8e5b4f3aced2b#437, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, prob#385, raw_pred#377, res1#390, res2#391)) AS features#438]\n   +- LogicalRDD [test_id#372L, question1#373, question2#374, q1len#375, q2len#376, raw_pred#377, Common#378L, Average#379, Percentage#380, Cosine#381, Jaccard#382, word_m#383, tfidf_word_match#384, prob#385, result1#386, result2#387, q1_n_words#388L, q2_n_words#389L, res1#390, res2#391]\n'

tr_df模式 -

tr_df.printSchema() 
root 
|-- prob: float (nullable = true)

DF模式 -

df.printSchema() 
root 
|-- test_id: long (nullable = true)

請幫助！提前致謝。

來源

2017-05-05 vishakha deshmukh

您是否想在'df'中的每一行添加相同的值？或者你可以通過'df'和'tr_df'的條件加入嗎？ –

每行不會包含不同的值。我不希望它適用於任何條件。 –

好吧，如果每一行都有不同的值，那麼你必須加入這些數據框並選擇需要的列。你能提供兩種數據框的模式嗎？ –

由於錯誤信息中明確指出需要spark.sql.crossJoin.enabled = true設置爲您的火花配置

您可以設置相同類似如下：

val sparkConf = new SparkConf().setAppName("Test") 
sparkConf.set("spark.sql.crossJoin.enabled", "true")

然後得到或通過將這種SparkConf

val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

創建SparkSession然後做你的加入...

來源：How to enable Cartesian join in Spark 2.0?

來源

2017-05-05 10:00:14

@Sanchit您能否請您在pyspark提供此解決方案。我在pyspark中這樣做了--'spark.conf.set（「spark.sql.crossJoin.enabled」，「true」）' 'n = df.join（tr_df）'。但它不適合我。 –

如何添加火花數據框中的火花數據框的列（使用Pyspark）？

回答

相關問題