How to Fix Elasticsearch Out of Memory: A Complete Troubleshooting Guide

How to Fix Elasticsearch Out of Memory: A Complete Troubleshooting Guide

If you’re staring at a red cluster status and a java.lang.OutOfMemoryError in your logs, you already know the panic that sets in. Elasticsearch is the backbone of your search infrastructure, and when it runs out of memory, everything downstream grinds to a halt.

I’ve spent years managing Elasticsearch clusters in production—from small three-node setups to massive 50+ node deployments. In this guide, I’ll walk you through exactly how to fix Elasticsearch out of memory errors, covering root cause analysis, step-by-step solutions, and prevention strategies that actually work in 2026.


Understanding the Out of Memory Error

Before jumping into fixes, you need to understand what “out of memory” actually means in the Elasticsearch context. There are two distinct types of memory exhaustion, and the solutions are completely different.

JVM Heap Memory vs. Off-Heap Memory

Elasticsearch uses the Java Virtual Machine (JVM), which manages memory in two primary areas:

  • JVM Heap: Used for query execution, aggregations, indexing buffers, and node-level bookkeeping. This is controlled by the -Xms and -Xmx flags.
  • Off-Heap (Direct Buffer) Memory: Used by Lucene for file system caching, segment merging, and other I/O operations. This relies on the operating system’s page cache.

The error messages you’ll see differ based on which memory area is exhausted:

Heap exhaustion errors:

java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: GC overhead limit exceeded

Off-heap/circuit breaker errors:

CircuitBreakingException[[parent] Data too large, data for [...] 
would be larger than limit of [...]]

The 50% Rule

The most common mistake I see is allocating too much RAM to the JVM heap. The golden rule is:

JVM Heap should be 50% of available physical RAM, and should never exceed 31 GB.

Why? Because Lucene relies heavily on the operating system’s file system cache to perform fast searches. If you give all your RAM to the JVM, Lucene has to read segments from disk, which is orders of magnitude slower.

The 31 GB limit exists because of JVM’s “compressed oops” feature. Below ~32 GB, the JVM can use compressed object pointers, which significantly reduces memory overhead. Above this threshold, pointers expand, and you actually lose usable memory.


Step 1: Diagnose the Problem

Check Current JVM Heap Settings

First, determine your current heap configuration:

# Check via Elasticsearch API
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty" | grep -E "heap_(used|max)_in_bytes"

# Or check the jvm.options file
cat /etc/elasticsearch/jvm.options | grep -E "^-Xm[sx]"

Identify Which Nodes Are Struggling

# Get heap usage per node
curl -X GET "localhost:9200/_cat/nodes?v&h=name,heap.percent,ram.percent,disk.used_percent"

# Get detailed JVM stats
curl -X GET "localhost:9200/_nodes/stats/jvm,gc?pretty"

Look for nodes where heap.percent is consistently above 75%. That’s a red flag.

Analyze Garbage Collection Logs

Enable GC logging if it’s not already on (it should be by default in Elasticsearch 7+):

# Check if GC logs exist
ls -la /var/log/elasticsearch/*gc*.log

# Tail the GC log to watch for problems
tail -f /var/log/elasticsearch/elasticsearch_gc.log

Watch for these warning signs:
– Frequent Full GC events (should be rare)
– Long GC pause times (>1 second is concerning)
– A pattern where GC runs but heap doesn’t decrease significantly (memory leak)

Check Circuit Breaker Trips

Elasticsearch has built-in circuit breakers that prevent OOM errors by failing requests before memory is exhausted:

# Check circuit breaker statistics
curl -X GET "localhost:9200/_nodes/stats/breaker?pretty"

Look at the tripped count. If it’s climbing, queries are being rejected to prevent OOM.


Step 2: Fix JVM Heap Configuration (Most Common Solution)

The Standard Fix

The most common cause of OOM errors is simply an undersized heap. Here’s how to fix it properly.

For Elasticsearch 7.x and 8.x:

Edit the jvm.options file:

sudo nano /etc/elasticsearch/jvm.options

Set both minimum and maximum heap to the same value:

# For a machine with 64GB RAM, allocate 31GB to heap
-Xms31g
-Xmx31g

# If using Docker, set via environment variable
# ES_JAVA_OPTS="-Xms31g -Xmx31g"

For newer installations using jvm.options.d:

# Create a custom override file
echo "-Xms16g" | sudo tee /etc/elasticsearch/jvm.options.d/heap.options
echo "-Xmx16g" | sudo tee -a /etc/elasticsearch/jvm.options.d/heap.options

Sizing Your Heap Correctly

Physical RAM Recommended Heap File Cache
8 GB 4 GB 4 GB
16 GB 8 GB 8 GB
32 GB 16 GB 16 GB
64 GB 31 GB 33 GB
128 GB 31 GB 97 GB

Restart After Changes

# Restart Elasticsearch
sudo systemctl restart elasticsearch

# Verify the new settings took effect
curl -X GET "localhost:9200/_nodes/stats/jvm?filter_path=nodes.*.jvm.mem.heap_used_in_bytes,nodes.*.jvm.mem.heap_max_in_bytes"

Step 3: Address Query-Induced Memory Pressure

If your heap settings are correct but you’re still hitting OOM, the problem is likely your queries.

Problem: Large Aggregations

Aggregations are memory-intensive because they need to build buckets in memory. A cardinality aggregation on a high-cardinality field can consume enormous amounts of heap.

Bad:

{
  "size": 0,
  "aggs": {
    "unique_users": {
      "cardinality": {
        "field": "user_id",
        "precision_threshold": 40000
      }
    }
  }
}

Better:

{
  "size": 0,
  "aggs": {
    "unique_users": {
      "cardinality": {
        "field": "user_id",
        "precision_threshold": 1000
      }
    }
  }
}

The precision_threshold controls the accuracy of the HyperLogLog algorithm. Lower values use dramatically less memory at the cost of some accuracy.

Problem: Deep Pagination

Using from + size for deep pagination forces Elasticsearch to load all documents from 0 to from + size into memory.

Bad (causes OOM on large datasets):

{
  "from": 100000,
  "size": 10,
  "query": { "match_all": {} }
}

Better – Use search_after:

{
  "size": 10,
  "query": { "match_all": {} },
  "sort": [
    { "timestamp": "asc" },
    { "_id": "asc" }
  ],
  "search_after": [1708982400000, "last_doc_id"]
}

Or use the Scroll API for batch processing:

# Initial scroll request
curl -X POST "localhost:9200/my_index/_search?scroll=1m&pretty" -H 'Content-Type: application/json' -d'
{
  "size": 1000,
  "query": { "match_all": {} }
}
'

Problem: Large Bulk Requests

Bulk indexing requests that are too large can spike heap usage:

# Bad: One massive bulk request
curl -X POST "localhost:9200/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @huge_file.json

# Better: Split into smaller chunks (5-15MB each)
split -l 10000 huge_file.json chunk_
for f in chunk_*; do
  curl -X POST "localhost:9200/_bulk" -H 'Content-Type: application/x-ndjson' --data-binary @$f
  sleep 1
done

Step 4: Tune Circuit Breakers

Circuit breakers are your safety net. Tuning them can prevent OOM crashes at the cost of rejecting some requests.

Check Current Settings

curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.breaker" | python3 -m json.tool

Adjust Circuit Breaker Limits

# Increase parent breaker limit (default 95%)
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "indices.breaker.total.limit": "70%"
  }
}
'

# Adjust fielddata breaker (default 40%)
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "indices.breaker.fielddata.limit": "40%",
    "indices.breaker.fielddata.overhead": 1.03
  }
}
'

# Adjust request breaker (default 60%)
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "indices.breaker.request.limit": "50%"
  }
}
'

Monitor Field Data Cache

# Check fielddata usage by field
curl -X GET "localhost:9200/_stats/fielddata/?fields=*&pretty"

# Clear fielddata cache if needed (emergency only)
curl -X POST "localhost:9200/_cache/clear?fielddata=true"

If specific fields are consuming excessive fielddata memory, you’ve likely mapped a text field for aggregation. Fix the mapping:

// Instead of this (forces fielddata):
{
  "properties": {
    "category": { "type": "text", "fielddata": true }
  }
}

// Do this (uses doc_values, stored on disk):
{
  "properties": {
    "category": { "type": "keyword" },
    "category_text": { "type": "text" }
  }
}

Step 5: Optimize Index and Shard Configuration

Reduce Shard Count

Every shard has memory overhead—approximately 50-150 MB of heap per shard regardless of size. Having too many small shards is a common cause of OOM.

# Check your current shard-to-data ratio
curl -X GET "localhost:9200/_cat/indices?v&h=index,docs.count,store.size,pri"

Target: 30-50 GB per shard for time-based indices.

If you have many small indices, consider shrinking them:

# Prepare index for shrinking (make it read-only, single copy)
curl -X PUT "localhost:9200/small_index/_settings" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.routing.allocation.require._name": "node-1",
    "index.blocks.write": true,
    "index.number_of_replicas": 0
  }
}'

# Shrink to 1 shard
curl -X POST "localhost:9200/small_index/_shrink/large_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 1,
    "index.codec": "best_compression"
  }
}'

Force Merge Old Indices

Segments consume file handles and cache. Force-merging read-only indices reduces overhead:

# Force merge old indices to a single segment
curl -X POST "localhost:9200/old_logs-*/_forcemerge?max_num_segments=1"

Delete Unnecessary Indices

# Set up ILM (Index Lifecycle Management) to auto-delete old data
curl -X PUT "localhost:9200/_ilm/policy/logs_cleanup" -H 'Content-Type: application/json' -d'
{
  "policy": {
    "phases": {
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}'

Step 6: Address Off-Heap Memory Issues

If you see errors like Direct buffer memory or OutOfMemoryError: Direct buffer memory, the problem is off-heap.

Increase Direct Memory

# In jvm.options
-XX:MaxDirectMemorySize=2g

On most systems, the default is fine, but high-throughput environments may need adjustment.

Check MMap Counts

Lucene uses memory-mapped files (mmap) for reading segments:

# Check current mmap count
cat /proc/sys/vm/max_map_count

# Elasticsearch recommends at least 262144
# Increase it permanently:
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Check Swappiness

Swap will destroy Elasticsearch performance:

# Check current swappiness
cat /proc/sys/vm/swappiness

# Set to 1 (or 0 on dedicated nodes)
sudo sysctl vm.swappiness=1

# Make permanent
echo "vm.swappiness=1" | sudo tee -a /etc/sysctl.conf

# Or disable swap entirely on dedicated ES nodes
sudo swapoff -a
# Comment out swap in /etc/fstab

Step 7: Upgrade JVM Garbage Collection Settings

Elasticsearch 7+ uses the G1GC garbage collector by default, which is generally good. But if you’re on a large heap (16GB+), you may need to tune it.

# In jvm.options or jvm.options.d/gc.options

## G1GC Configuration (default in ES 8.x)
-XX:+UseG1GC

## G1 Tuning parameters
-XX:MaxGCPauseMillis=200
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

## Disablebiased locking (recommended for ES)
-XX:+UseBiasedLocking

For very large heaps (20GB+), you might also consider:

## Increase G1 region size for large heaps
-XX:G1HeapRegionSize=32m

## Tune concurrent GC threads
-XX:ConcGCThreads=4
-XX:ParallelGCThreads=8

Step 8: Scale Horizontally

If you’ve optimized everything and still hit OOM, you need more capacity.

Add Data Nodes

# Install Elasticsearch on a new node, then configure:
# elasticsearch.yml
cluster.name: my-cluster
node.name: node-4
node.roles: [data]
network.host: 0.0.0.0
discovery.seed_hosts: ["node-1", "node-2", "node-3"]
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

Use Hot-Warm-Cold Architecture

Separate nodes by hardware profile:

# Hot node (fast SSD, more CPU)
node.attr.data: hot

# Warm node (larger HDD, less CPU)
node.attr.data: warm

# Cold node (cheap storage)
node.attr.data: cold

Then route indices appropriately:

curl -X PUT "localhost:9200/logs-2026.01/_settings" -H 'Content-Type: application/json' -d'
{
  "index.routing.allocation.require.data": "hot"
}'

Step 9: Handle Specific Edge Cases

Memory Leak in Old Plugin Versions

Some older plugin versions have memory leaks. Check your installed plugins:

# List installed plugins
/usr/share/elasticsearch/bin/elasticsearch-plugin list

# Check for known issues - update plugins regularly
/usr/share/elasticsearch/bin/elasticsearch-plugin remove old-plugin-name
/usr/share/elasticsearch/bin/elasticsearch-plugin install new-plugin-name

Mapping Explosions

A mapping with thousands of fields or dynamic mapping creating

Leave a Reply

Your email address will not be published. Required fields are marked *