如何匹配字符串與rdd的字段名稱

在我的pyspark 2.0.1版本中，我需要檢查特定名稱[說客戶端]是否出現在我的rdd列名稱中，如果該字段客戶端不是&，則生成錯誤消息目前在我的數據fame.Can請你提出一些語法像下面的語法如何匹配字符串與rdd的字段名稱

field='client' 
field not in df.schema.fields: 
print('field: ', field, "is not available)

來源

2017-10-17 Python Learner

它是rdd還是數據框？ – desertnaut

實際上我使用的是pyspark數據框，我用df.columns聲明並得到錯誤消息，說RDD對象沒有屬性'列'。 –

所以，這意味着，儘管你嘗試過，'df'不是一個數據框，而是一個rdd，這很重要，因爲rdd沒有'schema'屬性https://spark.apache.org/docs/2.2 .0/api/python/pyspark.html＃pyspark.RDD – desertnaut

對於RDDS：

spark.version 
# u'2.2.0' 

# make some dummy data: 
rdd = sc.parallelize([[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']]) # first element is the header 
header = rdd.first() 
header 
# [u'mailid', u'age', u'address'] 

field = 'client' 
if field not in header: 
    print('field: '+ field + " is not available") 
# field: client is not available

對於dataframes：

# using the rdd defined above 
# remove first line from data and use it as header: 
df = rdd.filter(lambda row : row != header).toDF(header) 
df.show() 
# +------+---+-------+ 
# |mailid|age|address| 
# +------+---+-------+ 
# | satya| 23| Mumbai| 
# | abc| 27| Goa| 
# +------+---+-------+ 

header_df = df.schema.names 
header_df 
# ['mailid', 'age', 'address'] 

field = 'client' 
if field not in header_df: 
    print('field: '+ field + " is not available") 
# field: client is not available

來源

2017-10-17 16:13:07 desertnaut

如何匹配字符串與rdd的字段名稱

回答

相關問題