Garbage Collection in Elasticsearch and the G1GC

5.00 avg. rating (93% score) - 1 vote

We have been using Elasticsearch as the storage and analysis tool for our centralized logging in addition to various full text search needs in different applications. The cluster being used for centralized logs handles a heavy load, indexing nearly 1.5 billion documents a day and going to about 50k documents per second during peak hours. The same cluster bears the load for searches to handle alerts and provide data to kibana dashboards. Hence the cluster performs large amount of aggregation queries in addition to document indexing.
With heavy loads the cluster started having performance issues and this led us to investigate.

The Investigation

For our centralized logging the elasticsearch cluster is made up of the following:
Elasticsearch Version: 6.2.2
No. Of Indices: 375
No. Of Shards including Replica: 1500
No. Of Nodes: 4, each with configuration (24 core CPU, 64GB RAM and 2TB SSD)
Java Version: 1.8.0_151
JVM heap allocated on each node: 28GB
There is no dedicated master. All the 4 nodes act as master and data nodes.

The first obvious thing was to look at the cluster logs to see if any unexpected things are happening. It was noticed immediately that there were many instances of GC overhead in the logs.

[2018-09-19T06:33:50,938][INFO ][o.e.m.j.JvmGcMonitorService] [node-93] [gc][61156] overhead, spent [310ms] collecting in the last [1s]

This generally does not cause much of an issue as long the pauses are within limit and not too frequent. But over the day, a closer detail of logs revealed the dreaded “stop the world” pauses happening frequently and the GC was unable free up any memory.

[2018-09-19T06:37:07,882][WARN ][o.e.m.j.JvmGcMonitorService] [node-93] [gc][old][61182][190] duration [59s], collections [1]/[59.9s], total [59s]/[3.2m], memory [25.7gb]->[25.6gb]/[27.8gb], all_pools {[young] [115.3mb]->[10.3mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [25.4gb]->[25.3gb]/[26.1gb]}

The above log line tells about a full GC which has taken 59.9 seconds in which it was only able to reduce heap usage from 25.7GB to 25.6GB out of total 27.8GB. The GC was able to reduce the size of the young gen and the survivor space, but they form a small part of the overall heap. So effectively very less memory was reclaimed. All of this was pointing the problem towards the heap memory and the garbage collection.

Garbage Collection In Elasticsearch

In Elasticsearch the default Garbage Collector is Concurrent-Mark and Sweep (CMS). Elasticsearch comes with very good default garbage collector settings, that they even suggest against tinkering it.  The CMS is a big improvement over the older parallel GC. As mentioned in the documentation

“The CMS collector attempts to reduce pause times due to major collections by using separate garbage collector threads to trace the reachable objects concurrently with the execution of the application threads. During each major collection cycle, the CMS collector pauses all the application threads for a brief period at the beginning of the collection and again toward the middle of the collection. The second pause tends to be the longer of the two pauses. Multiple threads are used to do the collection work during both pauses. The remainder of the collection (including most of the tracing of live objects and sweeping of unreachable objects is done with one or more garbage collector threads that run concurrently with the application. Minor collections can interleave with an ongoing major cycle, and are done in a manner similar to the parallel collector (in particular, the application threads are stopped during minor collections).“

The default GC settings on Elasticsearch is as below

-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

-XX:+UseConcMarkSweepGC This config tells the JVM to use the CMS GC
-XX:CMSInitiatingOccupancyFraction=75 This number specifies the percentage occupancy of the old gen space after which a concurrent collection will be triggered. Tuning this parameter correctly in advance allows the JVM to complete collection before old generation is full and thus avoid Full GC
-XX:+UseCMSInitiatingOccupancyOnly This tells the JVM to avoid using the self determined trigger and depend only on the value of CMSInitiatingOccupancyFraction for initiating a concurrent collection, thereby making the GC more predictable.

The CMS uses the model of contiguous memory regions. And when the allocated memory for the JVM is large, the size of the old gen tend to grow much larger before a GC of old gen is triggered. This can take longer time leading to situations where the GC is unable to clear the old gen before it fills up, also referred to as the concurrent mode failure. This causes the JVM to trigger a full GC resulting in the dreaded stop the world pauses, where the JVM is busy doing a GC stopping all application threads. This can happen because of

    • High rate of objects creation by virtue of high transactions leading to a lot of objects overflowing from the young gen to old gen.
    • Having a small heap size leading to less space for long lived objects.

Below graph presents the GC duration times for one node in our Elasticsearch cluster with the default CMS GC. [courtesy: http://gceasy.io]

CMS GC duration

As noticed from the graph, there were lot of Full GCs happening and they were taking 7-8 seconds to complete. This definitely had impact on the cluster performance.

Garbage First Garbage Collector (G1GC)

One of the issues with CMS was the need to collect the whole old gen at once whenever the GC is triggered. This is also attributed by the nature of contiguous memory regions in CMS. Hence the performance for CMS degrades with increase in heap size.
The G1GC follows the model of non-contiguous memory regions. It divides the whole heap into smaller regions, by default targeting 2048 regions. The size of the regions are not changeable and is a power of 2 between 1MB – 32MB. Each region can be the young gen or the old gen. The advantage of having more memory regions is that the GC can decide to analyze only part of the heap by choosing regions where there is more garbage. The number of regions to choose is also decided by target pause time. In short, the GC is run for a smaller memory region avoiding the collection of whole old gen at once, thereby reducing the GC pause times. More details about G1GC can be found here. The G1GC tend to perform better with larger heap sizes, generally, greater than 10GB.

G1GC on Elasticsearch

For our use case, each node on the Elastic cluster was allocated a heap of 28GB. This definitely suggest using the G1GC instead of the default CMS. As seen above, the GC pause duration was also quite high for CMS. But the biggest deterrent was the information on Lucene java bugs page

http://wiki.apache.org/lucene-java/JavaBugs
Do not, under any circumstances, run Lucene with the G1 garbage collector. Lucene’s test suite fails with the G1 garbage collector on a regular basis, including bugs that cause index corruption.

The java bug associated with the above statement has been marked resolved, and with degrading performance on out cluster we decided to use the G1GC. The main motivation was to get low garbage collection pauses and avoiding stop the world pauses. We took a snapshot of the indices (as a precaution) and changed the GC settings to as below.

-XX:+UseG1GC Enabling the G1GC
-XX:MaxGCPauseMillis=300 Value of the desired maximum pause time.
The other GC parameters were left to default. With the G1GC we saw the improvements in GC durations immediately. Below graph presents the GC duration times using the G1GC.

G1GC duration

With G1GC the the average GC duration has dropped from 7-8 seconds to half a second. This was a huge improvement.

CPU Usage in G1GC

The improvement in GC duration did not came free of cost. As expected, with G1GC, there was increase in CPU load average. The comparison in CPU load averages while using CMS and G1GC is shown below.

cms cpu usage

CPU load average with CMS (created using zabbix)

g1gc cpu usage

CPU load average with G1GC

The load average of the node CPU went up from around 6 to 8. It was within the limit and did not have any significant performance hits.

Humongous Objects and Allocations

In G1GC the memory region size plays a very important role. As stated earlier, the default value of region size is decided by the jvm, trying to achieve 2048 regions, and keeping the size as power of 2 and in the range of 1MB to 32 MB. This translates into the following region sizes as per allocated heap.

heap region size

Default region sizes as per allocated Heap

Any object, whose size is more than 50% of the region size, is considered as Humongous Object. They are allocated into contiguous set of regions referred as Humongous Regions. In such regions, the space between the last humongous object and the end of the region will remain unused. These unused space in the regions can cause the heap memory to become more fragmented. So having frequent humongous allocations can lead to GC performance issues.
In our Elasticsearch cluster, each node was allocated 28GB of heap, which translates into a default region size of 8MB. The cluster was handling frequent nested aggregation queries, and humongous object allocation was expected. From the GC logs the following was noticed

[2018-08-25T06:55:20.824+0000][125881][gc,heap ] GC(67) Humongous regions: 743->596

This indicates the collection in Humongous regions where Humongous objects were initially occupying 743 regions which got reduced to 596 post GC. From the GC logs it was determined that most of the time the Humongous objects were occupying 500 to 900 regions. Below table displays the causes of GC when region size was 8MB.

g1gc causes

GC causes when region size was 8MB

It can be seen that 3.7% time of GC was spent in Humongous allocation. To see the impact, we increased the region size to 16MB, using the settings
-XX:G1HeapRegionSize=16m This overrides the default region size in G1GC to 16 MB.
Logs related to GC of humongous regions:

[2018-09-03T09:47:07.248+0000][29623][gc,heap ] GC(368) Humongous regions: 298->254

It was seen that there was a drop in number of humongous regions, now in the range of 250-350. This was expected with increase in region size. Below table details out the causes of GC when region size was 16MB.

g1gc causes

GC causes when region size was 16MB

There was a slight improvement in terms of time spent in GC because of Humongous Objects when we increased the region size from 8MB to 16 MB. We did not increase the region size further as it will reduce the number of regions and have negative impact on the GC.

Conclusion

The G1GC definitely provides an improvement in terms of GC pauses and overall cluster performance in our use case. The issues related G1GC on Lucene seem to have been resolved or at least it didn’t came up in our scenario. CMS still performs very well in cases of lesser heap space. The decision to use G1GC, specially in elasticsearch, should be taken basis the application need in terms of throughput and cluster configurations. G1GC should also be configured accordingly. One thing to remember is that G1GC does not avoid stop the world pauses, it just tries to maintain a small pauses. It is helpful to use tools to analyse your GC logs to figure out how the JVM is handling the memory.