Elasticsearch內存問題 - ES進程消耗所有RAM

我們在生產Elasticsearch集羣時遇到了問題，Elasticsearch似乎隨時間消耗了每個服務器上的所有RAM。每個盒子都有128GB的RAM，所以我們運行兩個實例，每個JVM堆分配30GB。剩餘的68G留給OS和Lucene。我們上週重新啓動了每臺服務器，並且每個Elasticsearch進程的內存使用率爲24％。現在已經差不多一週了，我們的內存消耗已經上升到每個Elasticsearch實例約40％。我已經附加了我們的配置文件，希望有人能夠幫助弄清楚爲什麼Elasticsearch超出我們爲內存利用設置的限制。Elasticsearch內存問題 - ES進程消耗所有RAM

目前我們正在運行ES 1.3.2，但下一個版本將在下週升級到1.4.2。

這裏是頂級的（爲了清楚起見移除額外的字段）從重新啓動後右視圖：

PID USER  %MEM TIME+ 
2178 elastics 24.1 1:03.49 
2197 elastics 24.3 1:07.32

和一個今天：

PID USER  %MEM TIME+ 
2178 elastics 40.5 2927:50 
2197 elastics 40.1 3000:44

elasticserach -0.yml：

cluster.name: PROD 
node.name: "PROD6-0" 
node.master: true 
node.data: true 
node.rack: PROD6 
cluster.routing.allocation.awareness.force.rack.values: 
PROD4,PROD5,PROD6,PROD7,PROD8,PROD9,PROD10,PROD11,PROD12 
cluster.routing.allocation.awareness.attributes: rack 
node.max_local_storage_nodes: 2 
path.data: /es_data1 
path.logs:/var/log/elasticsearch 
bootstrap.mlockall: true 
transport.tcp.port:9300 
http.port: 9200 
http.max_content_length: 400mb 
gateway.recover_after_nodes: 17 
gateway.recover_after_time: 1m 
gateway.expected_nodes: 18 
cluster.routing.allocation.node_concurrent_recoveries: 20 
indices.recovery.max_bytes_per_sec: 200mb 
discovery.zen.minimum_master_nodes: 10 
discovery.zen.ping.timeout: 3s 
discovery.zen.ping.multicast.enabled: false 
discovery.zen.ping.unicast.hosts: XXX 
index.search.slowlog.threshold.query.warn: 10s 
index.search.slowlog.threshold.query.info: 5s 
index.search.slowlog.threshold.query.debug: 2s 
index.search.slowlog.threshold.fetch.warn: 1s 
index.search.slowlog.threshold.fetch.info: 800ms 
index.search.slowlog.threshold.fetch.debug: 500ms 
index.indexing.slowlog.threshold.index.warn: 10s 
index.indexing.slowlog.threshold.index.info: 5s 
index.indexing.slowlog.threshold.index.debug: 2s 
monitor.jvm.gc.young.warn: 1000ms 
monitor.jvm.gc.young.info: 700ms 
monitor.jvm.gc.young.debug: 400ms 
monitor.jvm.gc.old.warn: 10s 
monitor.jvm.gc.old.info: 5s 
monitor.jvm.gc.old.debug: 2s 
action.auto_create_index: .marvel-* 
action.disable_delete_all_indices: true 
indices.cache.filter.size: 10% 
index.refresh_interval: -1 
threadpool.search.type: fixed 
threadpool.search.size: 48 
threadpool.search.queue_size: 10000000 
cluster.routing.allocation.cluster_concurrent_rebalance: 6 
indices.store.throttle.type: none 
index.reclaim_deletes_weight: 4.0 
index.merge.policy.max_merge_at_once: 5 
index.merge.policy.segments_per_tier: 5 
marvel.agent.exporter.es.hosts: ["1.1.1.1:9200","1.1.1.1:9200"] 
marvel.agent.enabled: true 
marvel.agent.interval: 30s 
script.disable_dynamic: false

這裏是/ etc/SYSCONFIG/elasticsearch-0：

# Directory where the Elasticsearch binary distribution resides 
ES_HOME=/usr/share/elasticsearch 
# Heap Size (defaults to 256m min, 1g max) 
ES_HEAP_SIZE=30g 
# Heap new generation 
#ES_HEAP_NEWSIZE= 
# max direct memory 
#ES_DIRECT_SIZE= 
# Additional Java OPTS 
#ES_JAVA_OPTS= 
# Maximum number of open files 
MAX_OPEN_FILES=65535 
# Maximum amount of locked memory 
MAX_LOCKED_MEMORY=unlimited 
# Maximum number of VMA (Virtual Memory Areas) a process can own 
MAX_MAP_COUNT=262144 
# Elasticsearch log directory 
LOG_DIR=/var/log/elasticsearch 
# Elasticsearch data directory 
DATA_DIR=/es_data1 
# Elasticsearch work directory 
WORK_DIR=/tmp/elasticsearch 
# Elasticsearch conf directory 
CONF_DIR=/etc/elasticsearch 
# Elasticsearch configuration file (elasticsearch.yml) 
CONF_FILE=/etc/elasticsearch/elasticsearch-0.yml 
# User to run as, change this to a specific elasticsearch user if possible 
# Also make sure, this user can write into the log directories in case you change them 
# This setting only works for the init script, but has to be configured separately for systemd startup 
ES_USER=elasticsearch 
# Configure restart on package upgrade (true, every other setting will lead to not restarting) 
#RESTART_ON_UPGRADE=true

請讓我知道，如果有我可以提供任何其它數據。預先感謝您的幫助。

  total  used  free  shared buffers  cached 
Mem:  129022  119372  9650   0  219  46819 
-/+ buffers/cache:  72333  56689 
Swap:  28603   0  28603

來源

2015-02-06 KLD

目前尚不清楚。你的系統是否真的耗盡了內存？ – 2015-02-06 20:32:08

它有，這就是促使我們重新啓動集羣中的節點。我們正在將OOM錯誤和直接內存錯誤全部丟棄......現在它們已經達到了80％，我們還沒有收到錯誤，但是我想知道如何防止這兩個進程消耗100％的內存。 – KLD 2015-02-06 21:15:13

你可以運行免費-m並將結果添加到您的問題？你的意思是一個JVM OOM異常或者調用了Linux OOM殺手？ – 2015-02-06 23:37:33

你們看到的不是堆吹出來，堆總是會通過你在配置設置限制。免費的-m和關於操作系統相關使用的最高報告，所以在那裏的使用很可能是操作系統緩存FS調用。

這不會導致一個Java OOM。

如果您遇到Java OOM，它直接關係到Java堆空間不足，那麼還有其他的東西在起作用。你的日誌可能會提供一些信息。

來源

2015-02-12 01:06:20 Mark

我完全同意你的說法，堆應該保持在我設定的範圍內。問題是如何讓操作系統緩存這麼多的FS調用。有什麼辦法可以綁定這個，所以它會刷新並且不會開始殺死我的集羣？令人困惑的是，根據TOP，這個內存正在被彈性搜索過程所消耗......有什麼辦法可以告訴整個ES進程停止在某個點而不僅僅是堆的消費？ – KLD 2015-02-13 02:27:06

Elasticsearch內存問題 - ES進程消耗所有RAM

回答

相關問題