Designing a Distributed Cache with a Globally Aware Eviction Policy using Caffeine, InfintiSpan & RabbitMQ

Author: S.O.S

Originally Sourced from:

I’m using ScyllaDB / Cassandra as a global, distributed data store & Caffeine, Infinispan & Hazelcast as a local, in memory cache.

I have an application running on 1,000 nodes. When a node requests data from the global, distributed data store the data is cached locally using Caffeine / Infinispan. This way the application doesn’t have to request the same data over and over again - thereby reducing the load on the distributed data store.

When an individual node updates a given piece of data on the global data store it is relatively easy for that node to evict the corresponding data from its cache as it can simply signal to the local cache to evict/invalidate the data.

The problem is that any one of the 1,000 nodes can hold the same data and any node can update any piece of data at any time. If Node 539, for example, updates a specific piece of data and Node 877 holds a copy of that data in its local cache, I need Node 877 to evict the data from its local cache and retrieve the data in real time the next time it is needed. Of course, the same data can be cached on dozens of nodes and all of them would need to be made aware of the update made by Node 539 and evict the data accordingly.

What is the best way to design a system like this?

While I don’t want to reinvent the wheel I couldn’t find any existing solutions that is capable of achieving this so I devised my own plan:

My plan is to use a distributed messaging system such as RabbitMQ (and perhaps Kafka) where each of the 1,000 nodes subscribes to a topic which contains a list of data IDs that need to be evicted. Whenever a node updates a particular piece of data, it writes the “data ID” to the "eviction" topic. Every one of the 1,000 node subscribes to the "eviction" topic and evicts the data linked to the data ID in real time, if it holds that data in memory.

However, I have several concerns with this design.

First, it seems extremely inefficient. Every time any one of the 1,000 nodes updates a piece of data, the “data ID” will have to be propagated to all of the 1,000 nodes, since we don’t know which (if any) node(s) holds a specific piece of data. Additionally, it is likely that none of the nodes will cache most of the data being updated thereby making it even less efficient. It might even be more efficient just to read all the data in real time, every time.

Is there a more elegant/efficient design to achieve the desired goal?

I’m using Java.