蟒蛇elasticsearch散貨指數數據類型

我使用下面的代碼來創建彈性搜索索引和裝載數據蟒蛇elasticsearch散貨指數數據類型

from elasticsearch import helpers, Elasticsearch 
import csv 
es = Elasticsearch() 
es = Elasticsearch('localhost:9200') 
index_name='wordcloud_data' 
with open('./csv-data/' + index_name +'.csv') as f: 
    reader = csv.DictReader(f) 
    helpers.bulk(es, reader, index=index_name, doc_type='my-type') 

print ("done")

我的CSV數據如下

date,word_data,word_count 
2017-06-17,luxury vehicle,11 
2017-06-17,signifies acceptance,17 
2017-06-17,agency imposed,16 
2017-06-17,customer appreciation,11

數據加載罰款，但隨後數據類型不準確如何強制它說word_count是整數而不是文本看看它是如何計算日期類型的？有沒有辦法可以自動計算出int數據類型？或通過傳遞一些參數？

另外，如果我想要增加ignore_above或刪除某些字段，我該怎麼做。基本上對字符數量沒有限制？

{ 
    "wordcloud_data" : { 
    "mappings" : { 
     "my-type" : { 
     "properties" : { 
      "date" : { 
      "type" : "date" 
      }, 
      "word_count" : { 
      "type" : "text", 
      "fields" : { 
       "keyword" : { 
       "type" : "keyword", 
       "ignore_above" : 256 
       } 
      } 
      }, 
      "word_data" : { 
      "type" : "text", 
      "fields" : { 
       "keyword" : { 
       "type" : "keyword", 
       "ignore_above" : 256 
       } 
      } 
      } 
     } 
     } 
    } 
    } 
}

來源

2017-06-22 Naresh MG

您需要create a mapping來描述字段類型。

使用elasticsearch-py客戶端，可以使用es.indices.put_mapping或index.create方法完成此操作，方法是將它傳遞給描述映射的JSON文檔，即like shown in this SO answer。這將是這樣的：

es.indices.put_mapping(
    index="wordcloud_data", 
    doc_type="my-type", 
    body={ 
     "properties": { 
      "date": {"type":"date"}, 
      "word_data": {"type": "text"}, 
      "word_count": {"type": "integer"} 
     } 
    } 
)

不過，我建議採取看看elasticsearch-dsl包，提供much nicer declarative API to describe things。這將是沿着這些線路（未經測試）的東西：

from elasticsearch_dsl import DocType, Date, Integer, Text 
from elasticsearch_dsl.connections import connections 
from elasticsearch.helpers import bulk 

connections.create_connection(hosts=["localhost"]) 

class WordCloud(DocType): 
    word_data = Text() 
    word_count = Integer() 
    date = Date() 

    class Meta: 
     index = "wordcloud_data" 
     doc_type = "my-type" # If you need it to be called so 

WordCloud.init() 
with open("./csv-data/%s.csv" % index_name) as f: 
    reader = csv.DictReader(f) 
    bulk(
     connections.get_connection(), 
     (WordCloud(**row).to_dict(True) for row in reader) 
    )

請注意，我還沒有試過，我發佈的代碼 - 只是寫它。手頭沒有ES服務器進行測試。可能會有一些小錯誤或錯別字（請指出是否有錯誤），但總的想法應該是正確的。

來源

2017-06-22 10:28:31 drdaeman

謝謝我會試試這個，讓你知道 –

我只是改變了順序作爲文件中的相同順序...不知道它是否重要，所有似乎都工作正常從我可以告訴...這裏是其中一個文檔是應該用雙引號「word_count」存儲的整數類型：「12」，？---- {0128} 「_id」：「AVzS4_2-UW5hFY6GiWVj」，「_score」：1.0，「_source」：{ 「word_date」：「2017-06-17T00：00：00」，「word_count中」：「12」，「word_data」：「手機」 } –

@NareshMG不，它應該存儲並返回一個數字不是字符串（字符串在輸入時會被接受，但被強制轉換爲類型映射定義）。這可能是因爲您需要刪除現有數據*完全*（刪除索引）並重新創建它。只是定義一個映射不會更新已經存在的數據。如果你沒有這樣做，你的數據庫中就會有混合類型的數據。 – drdaeman

蟒蛇elasticsearch散貨指數數據類型

回答

相關問題