Remus Project: Full Memory Mirroring!
Imagine that you have a cluster with two machines side by side in an active/standby configuration. Let’s say you have your data replicated, and the systems are basically identical except for the IP address and hostname. You can use heartbeat to share an IP address such that if the primary fails, the secondary takes over. You can also perform the equivalent using “live migration” features in a Xen or VMWare hypervisor. The problem with these sorts of fail-overs is that any active TCP/IP sessions end up getting broken, and new connections must be established between clients and the application.
Okay, here’s something that fixes that problem: the Remus Project. The approach is brilliant. On regular intervals it ships the changed memory registers from one host to the other. Memory reading does not need to be replicated, only writes, and writes to the same location don’t all need to be replicated, only the most recent write. The primary node simply delays its response to TCP/IP packets (output buffering) until after it has confirmed that the standby node has received the replicated memory data. Very very clever.
Here are the key features listed on the Remus web site:
- The backup VM is an exact copy of the primary VM. When failure happens, it continues running on the backup host as if failure had never occurred.
- The backup is completely up-to-date. Even active TCP sessions are maintained without interruption.
- Protection is transparent. Existing guests can be protected without modifying them in any way.
Okay, I’ve been running HA systems in multiple geographies now for about a decade. I’ve experimented with lots and lots of clustering and replication technology. Most of the time when I hear about something new, I cringe and wonder if it’s just another thing that’s using the same old tricks I’ve been using for years, or if its something truly innovative and truly open source. Before you go making comments that VMWare has this feature or that feature, relax. This post is not about VMWare. It’s about open source Xen.
Now, you might already be wondering if this would work if you separated the two nodes to run in separate locations. The short answer is maybe. You would still need a very clever network configuration to re-route your traffic dynamically to the new location. For those of us that do operate our own Autonomous Systems, that may seem possible with a BGP route update. But here’s the bummer… The additional latency it would introduce would bring your performance to a screeching halt. You could probably afford to have about 25ms of average latency between two locations and get away with it. The cut-over would still be better than nothing, but you’d better have a rock solid network in there, and you’d better be ready to pump lots of bandwidth over it. Plan for 100Mb/sec if you checkpoint every 100ms.
This would be great for a high read application like a web cache, or some memcached applications. People ask on the memcached mailing list all the time how they can set up replication and HA. The answer is always “it’s a cache… not a database.”. Well, for those of you that want to do HA for a memcached system, give Remus a try.
Let’s not stop there. Imagine you have a SIP call control platform or Trixbox system, and you don’t want to lose all your active calls in the event of a system crash? Pretty much any mission critical application that supports long running connections over TCP/IP
Remus has been around for some time, so why am I so excited now? It’s now part of Xen! You don’t need to do anything special on the master or slave node to use it! Whoot! Now I’m impressed. Anyone out there have experience running it? I’d love to hear your thoughts.
You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.
When I saw the press release about Remus I immediately thought back to the mainframe style clustering techniques employed by Tandem & IBM in decades past. Most of these concepts were closely tied to customized hardware designed to fail without impacting a running program. It’s wonderful to think that we may soon see something close to this of availability on clusters of commodity hardware utilizing open-source Xen & Remus.
Perhaps what’s really needed now is software designed to run in tiny amounts of memory!
Yes! Seems to me that networks are reasonably fast, but we have a speed of light and switching latency problem. Transmitting signals over fiber optic networks still causes a lot of delays, so spreading these sorts of HA system far apart is not going to work well until we get around that constraint. Latency and Throughput are inversely proportional. I have a whole blog post on that coming soon.
It’s a shame that memory has become so cheap recently… leading server software developers to de-emphasize efforts to keep memory utilization low in server applications. If we started that up again, clustering this way would totally rock.