經由PySpark

插入在Elasticsearch陣列我有很多像這樣的情況下：經由PySpark

實施例數據幀：

from pyspark.sql.types import * 
schema = StructType([ # schema 
    StructField("id", StringType(), True), 
    StructField("email", ArrayType(StringType()), True)]) 
df = spark.createDataFrame([{"id": "id1"}, 
          {"id": "id2", "email": None}, 
          {"id": "id3","email": ["[email protected]"]}, 
          {"id": "id4", "email": ["[email protected]", "[email protected]"]}], 
          schema=schema) 
df.show(truncate=False) 
+---+------------------------------------+ 
|id |email        | 
+---+------------------------------------+ 
|id1|null        | 
|id2|null        | 
|id3|[[email protected]]     | 
|id4|[[email protected], [email protected]]| 
+---+------------------------------------+

欲插入此數據到Elasticsearch，所以據我的研究，我變身爲索引格式：

def parseTest(r): 
    if r['email'] is None: 
     return r['id'],{"id":r['id']} 
    else: 
     return r['id'],{"id":r['id'],"email":r['email']} 
df2 = df.rdd.map(lambda row: parseTest(row)) 
df2.top(4) 
[('id4', {'email': ['[email protected]', '[email protected]'], 'id': 'id4'}), 
('id3', {'email': ['[email protected]'], 'id': 'id3'}), 
('id2', {'id': 'id2'}), 
('id1', {'id': 'id1'})]

然後我嘗試插入：

es_conf = {"es.nodes" : "node1.com,node2.com", 
      "es.resource": "index/type"} 
df2.saveAsNewAPIHadoopFile(
    path='-', 
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", 
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf=es_conf)

而且我得到這個：

org.apache.spark.SparkException：數據類型的java.util.ArrayList 不能使用

Spark v 2.1.0 
ES v 2.4.4

沒有email場它的工作原理很好，我發現了一些建議的解決方案，使用es.output.json: true和json.dumps，但它似乎是版本5，所以我嘗試在另一個集羣中使用ES v5

df3 = df2.map(json.dumps) 
df3.top(4) 
['["id4", {"email": ["[email protected]", "[email protected]"], "id": "id4"}]', 
'["id3", {"email": ["[email protected]"], "id": "id3"}]', 
'["id2", {"id": "id2"}]', 
'["id1", {"id": "id1"}]'] 
es_conf2 = {"es.nodes" : "anothernode1.com,anothernode2.com", 
      "es.output.json": "true", 
      "es.resource": "index/type"} 
df3.saveAsNewAPIHadoopFile(
    path='-', 
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", 
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf=es_conf2)

然後我得到：

不能使用java.lang.String類型的
RDD元素

Spark v 2.1.0 
ES v 5.2.0

feelsbadman

來源

2017-02-08 dtj

我發現另一種方式做同樣的工作通過使用數據框對象的write方法。

因此，繼第一部分：

from pyspark.sql.types import * 
schema = StructType([ # schema 
    StructField("id", StringType(), True), 
    StructField("email", ArrayType(StringType()), True)]) 
df = spark.createDataFrame([{"id": "id1"}, 
          {"id": "id2", "email": None}, 
          {"id": "id3","email": ["[email protected]"]}, 
          {"id": "id4", "email": ["[email protected]", "[email protected]"]}], 
          schema=schema) 
df.show(truncate=False) 
+---+------------------------------------+ 
|id |email        | 
+---+------------------------------------+ 
|id1|null        | 
|id2|null        | 
|id3|[[email protected]]     | 
|id4|[[email protected], [email protected]]| 
+---+------------------------------------+

你只需要：

df.write\ 
    .format("org.elasticsearch.spark.sql")\ 
    .option("es.nodes","node1.com,node2.com")\ 
    .option("es.resource","index/type")\ 
    .option("es.mapping.id", "id")\ 
    .save()

無需轉換成RDD或修改以任何方式。

來源

2017-02-10 16:50:52 dtj

回答

相關問題