如何通過多值列過濾JSON數據

在Spark SQL的幫助下，我試圖過濾掉屬於特定組類別的所有業務項目。如何通過多值列過濾JSON數據

的數據是從JSON文件加載：

businessJSON = os.path.join(targetDir, 'yelp_academic_dataset_business.json') 
businessDF = sqlContext.read.json(businessJSON)

文件的架構如下：

businessDF.printSchema() 

root 
    |-- business_id: string (nullable = true) 
    |-- categories: array (nullable = true) 
    | |-- element: string (containsNull = true) 
    .. 
    |-- type: string (nullable = true)

我試圖提取連接到餐飲業的所有業務：

restaurants = businessDF[businessDF.categories.inSet("Restaurants")]

但它不工作，因爲據我所知，預期的列類型應該是一個字符串，b在我的情況下，這是數組。關於它告訴我一個例外：

Py4JJavaError: An error occurred while calling o1589.filter. 
: org.apache.spark.sql.AnalysisException: invalid cast from string to array<string>;

能否請您提出任何其他的方式來獲得我想要什麼？

來源

2015-09-07 Igor Sokolov

@亞切克，郭先生我不知道你的問題的修正是完全正確的。其實我沒有嘗試過使用inSet方法，但是找到了如何濾除基於多值字段的所有項目的方法。 –

UDF如何？

from pyspark.sql.functions import udf, col, lit 
from pyspark.sql.types import BooleanType 

contains = udf(lambda xs, val: val in xs, BooleanType()) 
df = sqlContext.createDataFrame([Row(categories=["foo", "bar"])]) 

df.select(contains(df.categories, lit("foo"))).show() 
## +----------------------------------+ 
## |PythonUDF#<lambda>(categories,foo)| 
## +----------------------------------+ 
## |        true| 
## +----------------------------------+ 

df.select(contains(df.categories, lit("foobar"))).show() 
## +-------------------------------------+ 
## |PythonUDF#<lambda>(categories,foobar)| 
## +-------------------------------------+ 
## |        false| 
## +-------------------------------------+

來源

2015-09-07 18:28:48 zero323

謝謝。這樣可行。 –

如何通過多值列過濾JSON數據

回答

相關問題