I have always admired Google :). The company delivers kick-ass technology, but it also delivers a kick-ass experience. Take Google Docs. Instant changes, no “Oh crap, I forgot to Ctrl-S, I am screwed.” I’m typing this in a Google doc. While it’s not the only reason (or even the main reason), I do believe that an important reason for Google’s quality is that Googlers use their own product, and they’d never let an embarrassment go out.
Contrast this with enterprise software. I have built database technologies for decades—yet, for a long time, I never used it (I mean really used it, as opposed to testing or demoing it). This is true for tons of enterprise software technologies. The developers who build it and the developers who use it are not one and the same.
What about open source software like Cassandra? Isn’t it built out of needs, and aren’t the developers who build it the developers who use it? Only partially. Apps use databases. Database developers build databases, whether it’s DB2, or Cassandra. Neat architecture, pretty bad user experience.
Technologies that work well are technologies that are built and used by the same developers. No ifs, ands, or buts about it.
That brings us to what I am doing now. I am immersed in microservices and API management. There are some things our developers will never have a good feel for. SAML 2.0 for connecting to SAP R/3? How could a developer in a 300-person startup have that intuition?
We have to call upon general product management, security, and enterprise software knowledge to build something that we hope is functional and usable, and then have a good feedback loop. The developer building the technology and the developer using the technology are different.
But microservices and APIs are different. We produce software for other folks to deploy APIs on. How do we do it? With small teams that aspire to the right set of microservices principles. With clean APIs. Each service needs to scale up and down.
The deployment of these services can have errors, so we have to implement the right blue-green deployment infrastructure. Teams are independent, so they need clean APIs, with documentation as contracts between them. Our developers employ microservices and API best practices. The software they build is a reflection of what they need, day in and day out.
Whenever you evaluate software, ask the vendor: “Do your developers use it?” It will help you make better decisions.
Image: The Noun Project/Kevin Augustine LO
Amazon Web Services suffered another outage Sunday morning. The cause this time seems to have been the DynamoDB service, which is a very scalable, cloud-backed data store. Lots of other services suffered from “increased error rates” at the same time—apparently either DynamoDB underpins more of AWS than we realized, or it’s all based on some other layer that Amazon doesn’t talk about publicly.
We’ve seen a pattern play out in the last few AWS outages that should make us all think carefully about how we use the cloud.
The problem: what if you can’t make changes?
One version of this pattern is best exemplified by the following list of status messages from the AWS Status Dashboard for the EC2 service:
3:12 AM PDT: We are investigating increased error rates and latencies for the EC2 APIs in the US-EAST region.
7:43 AM PDT: Error rates for impacted APIs are beginning to recover and new instance launches are starting to succeed.
In other words, early Sunday morning, it was hard to launch a new EC2 instance in the US-East region, and this condition persisted for four-and-a-half hours. Past EC2 outages have also affected this aspect of the service. Applications that weren’t dependent on DynamoDB Sunday morning kept running along fine—that is, just as long as they didn’t need to start new instances, or auto-scale, or do anything anything else that relies on the EC2 API.
The second version of the pattern occurs when thousands of ops teams detect a problem in AWS and respond by launching new instances somewhere else, such as in another region. We have seen this in the past, when launching a new instance can go from taking a few seconds to taking much longer.
For instance, imagine that an East Coast AWS outage was more than just a few services; say a whole Availability Zone catches on fire or is taken offline by an earthquake. Thousands of ops teams around the world start scrambling to launch new instances in other regions, or on other clouds. As huge as AWS is, that kind of load is going to start to play tricks on the management systems, and those instances might take a while to launch.
Do you want to be the last team in line when that happens?
The importance of geographic redundancy
So what does this mean to us as we try to design robust systems that run in the public cloud? The first lesson is the one that you can read about all over the place. That’s because Amazon did a smart and far-thinking thing here—they divided up their world into “regions” and “availability zones.”
AZs are close together, but physically isolated. As a result, an AWS region is actually a super-data-center that includes multiple redundant physical facilities. It’s arguably a lot more redundant than anything any of us would build on our own.
Regions, however, are not just physically isolated, but logically isolated. Not only do they run a totally different set of AWS software, but they are only connected via the public Internet. As a result, a software problem in one region cannot affect another one.
That means, that, as architects, one thing we should do is to try to design our systems to take advantage of multiple regions. Of course, it’s not as easy as the blogosphere makes it out to be.
For instance, at Apigee we have been running our service across regions since 2012, primarily by extending our Cassandra and Zookeeper clusters across regions. We were able to do this because our software is built around technology, like Cassandra, that inherently supports geographical replication on an unreliable network. (We also had to set up redundant VPN tunnels and do some careful planning).
If we had based our product on a traditional DBMS or some other type of technology, however, we probably would not have been able to achieve the SLAs that our customers demand.
But unless you were a big DynamoDB user, you didn’t even need to use multi-region availability in order to stay alive last weekend. In fact, although we run across regions, we didn’t need to use that capability to stay alive—that part of our service remained doing its normal job, which is to reduce latency.
What you do need, however, is to be a bit conservative.
The importance of systems management
When we design systems, we tend to go through all the things that can go down. “What happens if a disk fails?” “What happens if a server fails?” “What happens if we need to upgrade the database server?”
Then we get to management, and someone asks a question like, “What happens when the systems management server goes down?” Too often, the answer is, “It’s just management. Everything else will keep running if it’s down. Don’t worry about it.”
This, of course, is precisely the wrong answer.
The infinite scalability and API-driven nature of the cloud leads to all sorts of interesting possibilities. Auto scaling is easy to turn on. You never need to update the software on a server because it’s easier to launch a new server. And you certainly don’t need to have servers, already running, that you can use in case of a failure.
But every one of those possibilities assumes that you can make management changes to your cloud at all times. In fact, you are more likely to need to make those changes when something is going wrong in the public cloud—and that is precisely when you need them most.
The importance of conservatism
So although it’s not an exciting and ground-breaking way to use the cloud, conservatism has its virtues, at least in this case.
- Provision more capacity than you need
- Keep that capacity up and running and actively serving traffic at all times
- Spread that capacity across multiple data centers (AZs for Amazon customers)
- If you can replicate across regions, and it makes sense cost-wise, then do it
- Don’t assume that you can scale up in minutes, let alone seconds
- Don’t assume that you can replace instances in minutes, or even hours
- If your disaster recovery plan involves another cloud, expect it to be slow when you need it most
By the way, this isn’t just an AWS thing. Do you build your own management infrastructure? Do you depend on Puppet or Chef or Ansible or Salt Stack or a single Git server, or even GitHub itself? What happens when that’s down? Maybe it’s time to think about that.
Image: Kirill Ulitin/The Noun Project
We recently did some testing on Apache Usergrid, the open-source Backend as a Service, and found that it can reach 10,000 transactions per second, and can scale horizontally. This means the upcoming Usergrid 2 release is potentially the most scalable open-source BaaS available. Here's the story of how we got there.
What is Usergrid?
Apache Usergrid is a software stack that enables you to run a BaaS that can store, index, and query JSON objects. It also enables you to manage assets and provide authentication, push notifications, and a host of other features useful to developers—especially those working on mobile apps.
The project recently graduated from the Apache Incubator and is now a top-level project of the Apache Software Foundation (ASF). Usergrid also comprises the foundation for Apigee’s API BaaS product, which has been in production for three years now.
What’s new in Usergrid 2?
Usergrid 1 used Cassandra for all persistence, indexing, query, and graph relationships. The index and query engine wasn’t performing well, however, and was quite complex and difficult to maintain. Usergrid 2 provides a complete new persistence, index, and query engine and the index and query features are provide by ElasticSearch. This enables us to delete lots of code and trade up to a great search engine. Additionally, separating key-value persistence from index/query allows us to scale each concern separately.
As the architecture of Usergrid changed drastically, we needed to have a new baseline performance benchmark to make sure the system scaled as well as, if not better than, it did before.
Our testing framework and approach
The Usergrid team has invested a lot of time building repeatable test cases using the Gatling load-testing framework. Performance is a high priority for us and we need a way to validate performance metrics for every release candidate.
As Usergrid is open source, so are our Usergrid-specific Gatling scenarios, which you can find here: https://github.com/apache/usergrid/tree/two-dot-o-dev/stack/loadtests
Usergrid application benchmark
One of our goals was to prove that we had the ability to scale more requests per second with more hardware, so we started small and worked our way up.
As the first in our series of new benchmarking for Usergrid, we wanted to start with a trivial use case to establish a solid baseline for the application. All testing scenarios use the HTTP API and test the concurrency and performance of the requests. We inserted a few million entities that we could later read from the system. The test case itself was simple. Each entity has a UUID (universally unique identifier) property. For all the entities we had inserted, we randomly read them out by their UUID:
First, we tried scaling the Usergrid application by its configuration. We configured a higher number of connections to use for Cassandra and a higher number of threads for Tomcat to use. This actually yielded higher latencies and system resource usage for marginally the same throughput. We saw better throughput when there was less concurrency allowed. This made sense, but we needed more, and immediately added more Usergrid servers to verify horizontal scalability. What will it take to get to 10,000 RPS?
We started increasing the number of concurrent clients and adding more Usergrid servers. Once we got 10 Usergrid servers against our cluster of 6 Cassandra nodes, we noticed that our throughput increase was flattening and latencies were increasing. The Usergrid servers were fine on memory usage and CPU usage was starting to increase slowly.
It was time to see if Cassandra was keeping up. As we scaled up the load we found Cassandra read operation latencies were also increasing. Shouldn't Cassandra handle more, though? We observed a single Usergrid read by UUID was translating to about 10 read operations to cassandra. Optimization #1: reduce the number of read operations from Cassandra on our most trivial use case. Given what we know, we still decided to test up to a peak 10,000 RPS in the current state.
The cluster was scaled horizontally (more nodes) until we needed to vertically scale (bigger nodes) Cassandra due to high CPU usage. We stopped at 10,268 Requests Per Second with the following:
- thirty-five c3.xlarge Usergrid servers running Tomcat
- nine c3.4xlarge Cassandra nodes doing roughly100k operations/second
By this point numerous opportunities for improvement were identified in the codebase, and we had already executed on some. We fully expect to reach the same throughput with much less infrastructure in the coming weeks. In fact, we've already reached ~7,800 RPS with only 15 Usergrid servers since our benchmarking.
As part of this testing, not only did we identify code optimizations that we can quickly fix for huge performance gains, we also learned more about tuning our infrastructure to handle high concurrency. Having this baseline gives us the motivation to continually improve performance of the Usergrid application, reducing the cost for operating a BaaS platform at huge scale.
This post is just the start of our performance series. Stay tuned, as we’ll be publishing more results in the future for the following Usergrid scenarios:
- Query performance - this includes complex graph and geo-location queries
- Write performance - performance of directly writing entities as well as completing indexing
- Push notification performance - this is a combination of query and write performance
Image: Lara/The Noun Project
The following components are used by the Usergrid application, with the associated versions used for our benchmarking:
- Tomcat 7.0.62 where the Usergrid WAR file is deployed
- Cassandra 2.0.15 with Astyanax client
- Elasticsearch 1.4.4 (not utilized in these tests)
As part of benchmarking, we wanted to ensure that all configurations and deployment scenarios exactly matched how we would run a production cluster. Here are the main configurations recommended for production use of Usergrid: