2016-11-22 21 views
0

我試圖做一個簡單的計算一個選擇語句中,像這樣:Pyspark「重複」功能錯誤:列不可調用的

for d in dataframes: 
    d = d.select(
     'request_timestamp', 
     'shard_id', 
     'account_id', 
     repeat(lit('1'), (13 - length('account_id').cast(IntegerType()))).alias('custom')) 
    d.show() 

repeat函數返回以下錯誤:

--------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-13-07f1c7fd01f2> in <module>() 
    56  'account_id', 
    57 #  length('account_id').alias('len')) 
---> 58  repeat(lit('1'), length('account_id').cast(IntegerType())).alias('padding')) 
    59  d.show() 

/databricks/spark/python/pyspark/sql/functions.py in repeat(col, n) 
    1419  """ 
    1420  sc = SparkContext._active_spark_context 
-> 1421  return Column(sc._jvm.functions.repeat(_to_java_column(col), n)) 
    1422 
    1423 

/databricks/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args) 
    1122 
    1123  def __call__(self, *args): 
-> 1124   args_command, temp_args = self._build_args(*args) 
    1125 
    1126   command = proto.CALL_COMMAND_NAME +\ 

/databricks/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in _build_args(self, *args) 
    1086  def _build_args(self, *args): 
    1087   if self.converters is not None and len(self.converters) > 0: 
-> 1088    (new_args, temp_args) = self._get_args(args) 
    1089   else: 
    1090    new_args = args 

/databricks/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in _get_args(self, args) 
    1073     for converter in self.gateway_client.converters: 
    1074      if converter.can_convert(arg): 
-> 1075       temp_arg = converter.convert(arg, self.gateway_client) 
    1076       temp_args.append(temp_arg) 
    1077       new_args.append(temp_arg) 

/databricks/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_collections.py in convert(self, object, gateway_client) 
    510   HashMap = JavaClass("java.util.HashMap", gateway_client) 
    511   java_map = HashMap() 
--> 512   for key in object.keys(): 
    513    java_map[key] = object[key] 
    514   return java_map 

TypeError: 'Column' object is not callable 

我知道這可以通過udf輕鬆完成,但我想了解爲什麼即使我將其投射到Integer也無法完成此工作。

回答

1

Spark的repeat只能處理純整數作爲第二個參數。但Hive的repeat可以處理計算,你可以使用這樣的expr像這樣:

df.select(expr('repeat(1, 13 - length(account_id))').alias('custom'))