2016-12-12 111 views
0

修改熊貓代碼我有下面的代碼片斷,其用於創建的曲線圖。我想修改它在PySpark中工作,但不知道如何繼續。問題是我無法迭代PySpark中的列,並且我沒有成功嘗試將它變成函數。爲PySpark數據幀

背景:據幀有一個名爲City列這是城市的只是名字作爲一個字符串

cities = [i.City for i in df.select('City').distinct().collect()] 

stack = [] 

for city in cities: 
    df = sqlContext.sql( 'SELECT Complaint Type, COUNT(*) as `counts` ' 
          'FROM c311 ' 
          'WHERE City = "{}" COLLATE NOCASE ' 
          'GROUP BY `Complaint Type` ' 
          'ORDER BY counts DESC'.format(city)) 

    stack.append(Bar(x=df['Complaint Type'], y=df.counts, name=city.capitalize())) 

我的目標是再發送此toPandas()並在本地繪製它。不過,我自Column is not iterable以來遇到錯誤。我如何解決PySpark的問題?

回答

1

你可以:

from pyspark.sql.functions import upper, col 

pdf = df.withColumn("city", upper(col("city"))) \ 
    .groupBy("Complaint Type").pivot("city").count() \ 
    .toPandas() 

(或一組city和樞軸由type),並從那裏使用它。