Thursday, May 21, 2009

Terracotta's Distributed Garbage Collector

Purpose of this blog is to explain what terracotta's distributed garbage collector solves and then i'll go through some steps which can help setup DGC so that it catches up with the application's garbage creation rate.

Lets first understand what is garbage in terracotta's world. The architecture terracotta follow is that it has an array of terracotta servers called L2s and then a number of applications (user apps called L1s) which share data across each other connected to all the L2s. In the shared object if any L1 changes anything then that delta change is propagated to all the L1s through the array of L2s. So whenever a new object is created in the shared object graph it will be created on of the the L2s and if any of the other L1 needs it will fault that from the particular L2. So as your application proceed it will create a lot of objects spread through all the L2s in the shared object graph.
Now any L1 can make the reference of any of that object to null thus leading to a shared object graph in which a number of objects would no longer be reachable from the root thus creating garbage. The jvm GC will collect those objects in L1s but it would be terracotta's job to collect them from the L2.

So to cut the long story short we have an object graph spread across a number of servers with a number of objects as garbage. Below is a pictorial view where each reactangle represent one L2

As you can see here object 6, 7, 9 and 11 are garbage and terracotta's DGC has to identify and delete them from each L2.

Terracotta's DGC is actually a concurrent mark and sweep garbage collector. The way DGC works can be explained in the steps below

1. Each server array has an active coordinator which coordinates among all L2s for the whole DGC cycle
2. Coordinator sends a init message to each L2 for starting the DGC cycle
3. Each L2 starts the dgc cycle by assuming all objects it has as garbage candidate
4. The coordinator then sends a collect message which tells all L2s to remove objects which are reachable from the roots to be removed from the garbage candidates.
5. For each object reachable from any of the roots, L2 removes it from the gc candidates and fetches the referenced objects and then execute one of the following step to remove all objects reachable from this object
  • If the referenced object reside in the same server it execute step 5 for this object
  • If the referenced object reside in any other server then the current server sends a message to that server with this referenced object id which does step 5 for the particular object
6. Next step is rescue in which objects which become reachable from root during mark phase are rescued from the gc candidates. This step pauses the application so that no references are changed/created.
7. After mark stage is completed objects in disk are sweeped by Berkeley DB cleaner. In Active-Passive mode, the resulting garbage is notified to Passive so that passive deletes those objects.
8. One of the challenge of this implementation was to synchronize each step. As the object graph is distributed and no server knows the whole graph, determining when the traversal is over is complex. The way its done is as follows
  • A ticker token message is passed around starting from the coordiantor and in a circle to each active server
  • Upon receiving of a ticker token each server updates its state to be clean if it has nothing to process for traversal and dirty if it is processing.
  • When this ticker comes back to the coordinator and is clean then it means that no server has anything left to process and hence the step is considered to be complete.
The distribution of the object graph makes the implementation complex for failure cases. To the cycle to be able to complete each active server group has to be available during the time of the dgc cycle. If any one of them gets disconnected the particular cycle has to be canceled. Following are the conditions to start and complete a DGC cycle
  1. DGC cycle would not start if another cycle is already running and is not in delete stage
  2. DGC cycle would not start (DGC would be in disabled state) if any of the passive of any of the group is sycncing with the active
  3. DGC cycle will get cancled if any of the group becomes unavilable during the cycle and the cycle is not in delete stage
  4. If a failover happens when mark phase is completed then the cycle is not canceled and all the groups which are up will process its delete stage and delete garbage.
The implementation has a lot of synchronization challenges and the algorithm has performed tremendously good. We have seen a DGC cycle complete in less than 40 seconds for around 8m objects with 4 active server group.

Things to tune for DGC to perform better

The most siginficant bottleneck for a DGC cycle is disk. So whenever the objects of the graph are swapped out of memory and are in disk, their lookup and deletion both becomes expensive and that can cause dgc cycle to take a long time to complete which in turn cane make dgc fall behind the garbage creation rate which in turn increases the possibility of objects to be in disk rather than memory making dgc more and more slow.
Whenever DGC is perfroming badly the first thing you should look is "Cache Miss Rate" in dev console for each active server's run time statistics. If that is significantly high then dgc will make objects fault from disk to the memory.
You can overcome this by two approaches
1. Increase the heap size thus making more objects to fit in the memory and reducing the chances of faulting from disk.
2. Make DGC interval shorter, deleting garbage more frequently which may help objects to be in memory and not in disk.

If the delete stage is taking a huge time then you may also want to tune berkely db properties.