While running our Apache Cassandra (version 2.1.11) with the default configuration of 8GB sized heap (it’s calculated automatically by the cassandra-env.sh script at startup) I’ve encountered massive GC overhead and great latency peaks during queries (both Reads and Writes), when I looked at the metrics I saw that the heap of all of the nodes was filling up really quickly because of the massive wrtie throughput (about several million writes simultaneously),
- In the “cassandra-env.sh” file add the follwing lines:
- You’ll need to comment out all of the CMS specific flags that will probably be there by default. search for these and be careful for others:
- -XX:+UseParNewGC
- -XX:+UseConcMarkSweepGC
- -XX:+CMSParallelRemarkEnabled
- -XX:SurvivorRatio=8
- -XX:MaxTenuringThreshold=1
- -XX:CMSInitiatingOccupancyFraction=75
- -XX:+UseCMSInitiatingOccupancyOnly
- -XX:CMSWaitDuration=10000
- -XX:+CMSClassUnloadingEnabled
just restart the node and you’ll be up and running!
What Is Garbage Collection?
I will focus mainly on the JVM implementations after Java 7 update that has the G1 algorithm.
- Default GC (Serial Mark & Sweep, Serial NewGC)
- CMS - Concurrent Mark and Sweep.
- G1 - Garbage First (Oracle JDK 7 update 4 and later releases)
- Default GC (Serial Mark & Sweep, Serial NewGC)
Shows highest throughput as long as no Full GC is triggered. If your system has to run for a limited amount of time (say 12 hours) and you are willing to invest into a very careful programming style regarding allocation; keep large datasets Off-Heap, DefaultGC can be the best choice. Of course there are applications which are ok with some long GC pauses here and there. - CMS does best in pause-free low latency operation as long you are willing to throw memory at it. Throughput is pretty good. Unfortunately it does not compact the heap, so fragmentation can be an issue over time. CMS is still way behind commercial low-latency solutions such as Azul's Zing VM.
- G1 excels in robustness, memory efficiency with acceptable throughput. While CMS and DefaultGC react to OldSpace overflow with Full-Stop-GC of several seconds up to minutes (depends on Heap size and Object graph complexity), G1 is more robust in handling those situations. Taking into account the benchmark represents a worst case scenario in allocation rate and programming style, the results are encouraging.
- Initial Mark (Stop the World Event) - Objects in old generation are “marked” as reachable including those who may be reachable from the young generation. Pause times are typically short in duration relative to minor collection pause times.
- Concurrent Marking - Traverse the tenured generation object graph for reachable objects concurrently while Java application threads are executing. Starts scanning from marked objects and transitively marks all objects reachable from the roots. The mutators are executing during the concurrent phases 2, 3, and 5 and any objects allocated in the CMS generation during these phases (including promoted objects) are immediately marked as live.
- Remark (Stop the World Event) - Finds objects that were missed by the concurrent mark phase due to updates by Java application threads to objects after the concurrent collector had finished tracing that object.
- Concurrent Sweep - Collects the objects identified as unreachable during marking phases. The collection of a dead object adds the space for the object to a free list for later allocation. Coalescing of dead objects may occur at this point. Note that live objects are not moved.
- Resetting - Prepare for next concurrent collection by clearing data structures.
You can see the illustration of the process in the next diagrams:
- Local variables
- Active threads
- Static fields
- JNI references
- Others (will be discussed later)
- Marking - is walking through all reachable objects and keeping a ledger in native memory about all such objects
- Sweeping - is making sure the memory addresses occupied by non-reachable objects can be reused by the next allocations.
The not-so-good news is that the application threads need to be stopped for the collection to happen as you cannot really count references if they keep changing all the time. Such a situation when the application is temporarily stopped so that the JVM can indulge in housekeeping activities is called a Stop The World pause. They may happen for many reasons, but garbage collection is by far the most popular one.
- The heap is a single memory space split into regions.
- Young generation memory is composed of a set of non-contiguous regions. This makes it easy to resize when needed.
- Young generation garbage collections, or young GCs, are stop the world events. All application threads are stopped for the operation.
- The young GC is done in parallel using multiple threads.
- Live objects are copied to new survivor or old generation regions.
- Initial Mark (Stop the World Event) - This is a stop the world event. With G1, it is piggybacked on a normal young GC. Mark survivor regions (root regions) which may have references to objects in old generation.
- Root Region Scanning - Scan survivor regions for references into the old generation. This happens while the application continues to run. The phase must be completed before a young GC can occur.
- Concurrent Marking - Find live objects over the entire heap. This happens while the application is running. This phase can be interrupted by young generation garbage collections.
- Remark (Stop the World Event) - Completes the marking of live object in the heap. Uses an algorithm called snapshot-at-the-beginning (SATB) which is much faster than what was used in the CMS collector.
- Cleanup (Stop the World Event and Concurrent) - Performs accounting on live objects and completely free regions. (Stop the world) Scrubs the Remembered Sets. (Stop the world) Reset the empty regions and return them to the free list. (Concurrent)
- Copying (Stop the World Event) - These are the stop the world pauses to evacuate or copy live objects to new unused regions. This can be done with young generation regions which are logged as [GC pause (young)]. Or both young and old generation regions which are logged as [GC Pause (mixed)].
- Initial Marking Phase - Initial marking of live object is piggybacked on a young generation garbage collection. In the logs this is noted as GC pause (young)(inital-mark).
- Concurrent Marking Phase - If empty regions are found (as denoted by the "X"), they are removed immediately in the Remark phase. Also, "accounting" information that determines liveness is calculated.
- Remark Phase - Empty regions are removed and reclaimed. Region liveness is now calculated for all regions.
- Copying/Cleanup Phase - G1 selects the regions with the lowest "liveness", those regions which can be collected the fastest. Then those regions are collected at the same time as a young GC. This is denoted in the logs as [GC pause (mixed)]. So both young and old generations are collected at the same time.
- After Copying/Cleanup Phase - The regions selected have been collected and compacted into the dark blue region and the dark green region shown in the diagram.
- Concurrent Marking Phase
- Liveness information is calculated concurrently while the application is running.
- This liveness information identifies which regions will be best to reclaim during an evacuation pause.
- There is no sweeping phase like in CMS.
- Uses the Snapshot-at-the-Beginning (SATB) algorithm which is much faster then what was used with CMS.
- Completely empty regions are reclaimed.
- Young generation and old generation are reclaimed at the same time.
- Old generation regions are selected based on their liveness.
- “Java Garbage Collection handbook” by Plumbr.
- “Garbage First Garbage Collector Tuning” by Monica Beckwith.
- “G1GC v.s. CMS for Cassandra 2.0” by Al Tobey (My new Tuning guru!)
- Do not Set Young Generation Size
- Explicitly setting young generation size via -Xmn meddles with the default behavior of the G1 collector.
- G1 will no longer respect the pause time target for collections. So in essence, setting the young generation size disables the pause time goal.
- G1 is no longer able to expand and contract the young generation space as needed. Since the size is fixed, no changes can be made to the size.
Instead of using average response time (ART) as a metric to set the XX:MaxGCPauseMillis=<N>, consider setting value that will meet the goal 90% of the time or more. This means 90% of users making a request will not experience a response time higher than the goal. Remember, the pause time is a goal and is not guaranteed to always be met.
A promotion failure that happens when a JVM runs out of heap regions during the GC for either survivors and promoted objects. The heap can't expand because it is already at max. This is indicated in the GC logs when using -XX:+PrintGCDetails by to-space overflow. This is expensive!
- GC still has to continue so space has to be freed up.
- Unsuccessfully copied objects have to be tenured in place.
- Any updates to RSets of regions in the CSet have to be regenerated.
- All of these steps are expensive.
- Increase heap size
- Increase the -XX:G1ReservePercent=n, the default is 10.
- G1 creates a false ceiling by trying to leave the reserve memory free in case more 'to-space' is desired.
- -verbosegc (which is equivalent to -XX:+PrintGC) sets the detail level of the log to fine.
Sample Output
[GC pause (G1 Evacuation Pause) (mixed) 66M->21M(236M), 0.1625268 secs]
- -XX:+PrintGCDetails sets the detail level to finer. The options shows the following information:
- Average, Min, and Max time are displayed for each phase.
- Root Scan, RSet Updating (with processed buffers information), RSet Scan, Object Copy, Termination (with number of attempts).
- Also shows “other” time such as time spent choosing CSet, reference processing, reference enqueuing and freeing CSet.
- Shows the Eden, Survivors and Total Heap occupancies.
[Eden: 818M(818M)->0B(714M) Survivors: 0B->104M Heap: 836M(4096M)->409M(4096M)]
- -XX:+UnlockExperimentalVMOptions -XX:G1LogLevel=finest sets the detail level to its finest. Like finer but includes individual worker thread information.
[Ext Root Scanning (ms): 2.1 2.4 2.0 0.0
Avg: 1.6 Min: 0.0 Max: 2.4 Diff: 2.3]
[Update RS (ms): 0.4 0.2 0.4 0.0
Avg: 0.2 Min: 0.0 Max: 0.4 Diff: 0.4]
[Processed Buffers : 5 1 10 0
Sum: 16, Avg: 4, Min: 0, Max: 10, Diff: 10]
- -XX:+PrintGCTimeStamps - Shows the elapsed time since the JVM started.
Sample Output
1.729: [GC pause (young) 46M->35M(1332M), 0.0310029 secs] - -XX:+PrintGCDateStamps - Adds a time of day prefix to each entry.
2012-05-02T11:16:32.057+0200: [GC pause (young) 46M->35M(1332M), 0.0317225 secs]