6
如果我有一個DataFrame的Scala段落,我可以共享和使用python。 (據我所知pyspark使用py4j)Zeppelin:斯卡拉Dataframe到python
我嘗試這樣做:
斯卡拉段:
x.printSchema
z.put("xtable", x)
Python的段落:
%pyspark
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
the_data = z.get("xtable")
print the_data
sns.set()
g = sns.PairGrid(data=the_data,
x_vars=dependent_var,
y_vars=sensor_measure_columns_names + operational_settings_columns_names,
hue="UnitNumber", size=3, aspect=2.5)
g = g.map(plt.plot, alpha=0.5)
g = g.set(xlim=(300,0))
g = g.add_legend()
錯誤:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark.py", line 222, in <module>
eval(compiledCode)
File "<string>", line 15, in <module>
File "/usr/local/lib/python2.7/dist-packages/seaborn/axisgrid.py", line 1223, in __init__
hue_names = utils.categorical_order(data[hue], hue_order)
TypeError: 'JavaObject' object has no attribute '__getitem__'
解決方法:
// registerTempTable in Spark 1.x
df.createTempView("df")
在Python與SQLContext.table
和閱讀:
%pyspark
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import StringIO
def show(p):
img = StringIO.StringIO()
p.savefig(img, format='svg')
img.seek(0)
print "%html <div style='width:600px'>" + img.buf + "</div>"
df = sqlContext.table("fd").select()
df.printSchema
pdf = df.toPandas()
g = sns.pairplot(data=pdf,
x_vars=["setting1","setting2"],
y_vars=["s4", "s3",
"s9", "s8",
"s13", "s6"],
hue="id", aspect=2)
show(g)
當使用星火1.6.0或以前,你需要顯式聲明對使用的每種語言的新SQLContext。事實上,由於[SPARK-13180](https://issues.apache.org/jira/browse/SPARK-13180)錯誤,Zeppelin在啓動時創建的HiveContext無法正常工作。在這種情況下,我發現跨Python和Scala共享DataFrame的唯一方法是將Dataframe引用本身放入Scala的Zeppelin上下文中,並使用DataFrame(z.get(「df」),sqlContext)從Python中恢復它。 –
通過創建任何可以在'%sql'中訪問它的臨時表 – Junaid