2017-07-31 36 views
1

默認情況下,spark_read_jdbc()將整個數據庫表讀入Spark。我用下面的語法來創建這些連接。如何在從JDBC連接讀取時使用謂詞?

library(sparklyr) 
library(dplyr) 

config <- spark_config() 
config$`sparklyr.shell.driver-class-path` <- "mysql-connector-java-5.1.43/mysql-connector-java-5.1.43-bin.jar" 

sc <- spark_connect(master   = "local", 
        version  = "1.6.0", 
        hadoop_version = 2.4, 
        config   = config) 

db_tbl <- sc %>% 
    spark_read_jdbc(sc  = ., 
        name = "table_name", 
        options = list(url  = "jdbc:mysql://localhost:3306/schema_name", 
           user  = "root", 
           password = "password", 
           dbtable = "table_name")) 

但是,我現在遇到了我在MySQL數據庫中有一個表的情況,我寧願只將這個表的一個子集讀入到Spark中。

如何獲得spark_read_jdbc接受謂詞?我已經嘗試添加謂詞沒有成功選項列表,

db_tbl <- sc %>% 
    spark_read_jdbc(sc  = ., 
        name = "table_name", 
        options = list(url  = "jdbc:mysql://localhost:3306/schema_name", 
           user  = "root", 
           password = "password", 
           dbtable = "table_name", 
           predicates = "field > 1")) 

回答

1

您可以查詢替代dbtable

db_tbl <- sc %>% 
    spark_read_jdbc(sc  = ., 
       name = "table_name", 
       options = list(url  = "jdbc:mysql://localhost:3306/schema_name", 
          user  = "root", 
          password = "password", 
          dbtable = "(SELECT * FROM table_name WHERE field > 1) as my_query")) 

但這樣的星火簡單的條件時,過濾器應自動將其推:

db_tbl %>% filter(field > 1) 

只要確保設置:

memory = FALSE 

in spark_read_jdbc