2016-12-24 64 views
0

我正在嘗試從rdd寫入elasticsearch(pyspark,python 3.5)。 我能夠正確編寫json的主體,但是彈性搜索而不是採用我的_id,它創建它自己的。無法在elasticsearch-hadoop上設置_id

我的代碼:

class Article: 
    def __init__(self, title, text, text2): 
     self.id_ = title 
     self.text = text 
     self.text2 = text2 

if __name__ == '__main__': 

    pt=_sc.parallelize([Article("rt", "ted", "ted2"),Article("rt2", "ted2", "ted22")]) 
     save=pt.map(lambda item: 
     (item.id_, 
      { 
      'text' : item.text, 
      'text2' : item.text2 
      } 
     )) 

     es_write_conf = { 
      "es.nodes": "localhost", 
      "es.port": "9200", 
      "es.resource": 'db/table1' 
     } 
     save.saveAsNewAPIHadoopFile(
      path='-', 
      outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", 
      keyClass="org.apache.hadoop.io.NullWritable", 
      valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
      conf=es_write_conf) 

程序跟蹤: link to the image

回答

0

這是映射到索引的設置,U,可在官方用戶指南中找到答案。
示例代碼如下:

curl -XPOST localhost:9200/test -d '{ 
    "settings" : { 
     "number_of_shards" : 1, 
     "number_of_replicas":0 
    }, 
    "mappings" : { 
     "test1" : { 
      "_id":{"path":"mainkey"}, 
      "_source" : { "enabled" : false }, 
      "properties" : { 
       "mainkey" : { "type" : "string", "index" : "not_analyzed" } 
      } 
     } 
    } 
}'