2016-09-26 39 views
1

我有一個sparkR數據框稱爲Tweets與列名爲bodyTextsparkr數據框按列過濾使用正則表達式

我想要做的是通過bodyText上的正則表達式條件過濾數據幀。因此,例如,通過在bodyText中具有「反彈」或「抗議」的推文進行過濾。

我迄今爲止嘗試是:

subset(twitter_df, grepl("(?<=\\b)rally", twitter_df$bodyText, ignore.case = TRUE)) 
filter(twitter_df, grepl("(?<=\\b)rally", twitter_df$bodyText, ignore.case = TRUE)) 

但在這兩種情況下收到此錯誤:

Error in as.character.default(x) : no method for coercing this S4 class to a vector Calls: main ... .local -> [ -> grepl -> as.character -> as.character.default

回答

0

您可以在Spark數據幀轉換爲RDD,應用過濾器和轉換回:

# setup reproducable sample 
df <- data.frame(id=c(1:4), bodyText=c("rally","protest","text1","text2")) 
head(twitter_df.filtered) 
twitter_df <- as.DataFrame(df) 
head(twitter_df) 


# convert to rdd 
twitter_df.rdd <- SparkR:::toRDD(twitter_df) 
# filter rdd 
twitter_df.rdd.filtered <- SparkR:::filterRDD(twitter_df.rdd, function(s) { grepl("(?<=\\b)rally", s$bodyText, ignore.case = TRUE, perl = TRUE) }) 
# convert to Spark data frame 
twitter_df.filtered <- as.DataFrame(twitter_df.rdd.filtered) 
head(twitter_df.filtered) 

注意參數perl設置爲TRUE或T他用表達無效。

+0

作品,非常感謝。我只需要添加sqlContext as.DataFrame: 'as.DataFrame(sqlContext,df)' –

0

如果使用Spark Sql in SparkR,它可能是那樣簡單:

df <- data.frame(id=c(1:4), bodyText=c("rally","protest","text1","text2")) 

createOrReplaceTempView(df, "tweets") 
rallys <- head(sql("SELECT * FROM tweets WHERE bodyText rlike 'rally'")) 

print(rallys)