Application Development

Apache Cassandra, developed by Avinash Lakshman and Prashant Malik to try to solve their Inbox-search problem at Facebook, was published as free software under the Apache V2 license in July 2008. Providing a scalable, high-availability datastore with no single point of failure, Cassandra is well suited for high-availability applications. It supports multi-datacenter replication, and offers massive and linear scalability, so any number of nodes can easily be added to any Cassandra cluster in any datacenter. According to the website, the largest known Cassandra setup involves over 300TB of data on over 400 machines.

After ten years of development, driven in part by contributions from IBM, Twitter and Rackspace, Cassandra is now used by NetFlix, eBay, Twitter, Reddit and many others, and is one of the most popular NoSQL-databases in use today. To find out more about the impact Cassandra has had on the development community, we speak to previous Apache Cassandra project chair Jonathan Ellis, currently SVP and CTO, DataStax; Aaron Morton, CEO at Cassandra consultants, The Last Pickle; and open source consultant Carlos Rolo.

How has Cassandra impacted the community?

Jonathan Ellis, SVP and CTO, DataStax, first come into contact with Cassandra at the end of 2008, when he was hired by Rackspace to build them a next-generation, scalable database. He explains that Cassandra was one of a number of options at the time that offered ‘NoSQL’, but argues that SQL itself wasn’t the problem: “SQL is a quite reasonable language for getting data in and out of a server.”

The introduction of Cassandra Query Language (CQL) with Cassandra 1.1 in 2012 was one of the most important steps for the community, according to Ellis, because it meant developers had an API portable across languages and suitable for a REPL. “We were the first to introduce this,” Ellis explains, “with almost universal adoption of a similar approach by other NoSQL databases.” The only notable holdout today is Amazon’s DynamoDB, and Ellis doesn’t believe that will last – “I predict that it won’t be long before they follow suit as well.”

For Ellis, the biggest contribution Cassandra has made is that app developers – whether Cassandra users or not -- realized that you don’t need ACID for most common tasks. “Cassandra defaults to eventually consistent operations (where “eventually” is typically single-digit or even sub-millisecond latencies), and allows users to opt in to lightweight transactions when linearizable consistency is called for.”

Aaron Morton, who was working at Weta Digital when he first came across Cassandra in 2009, says, “[Cassandra] was the first time I felt I could contribute to the code of a database and get involved in an early stage project.” Two years later Morton left Weta to found his own Cassandra consulting firm The Last Pickle. Of Cassandra, Morton says, “DBA's are now concerned with the speed of light when storing data around the globe rather than the speed of disks, that's a big change and in large part is due to the success of Apache Cassandra.” 

Open source consultant, Carlos Rolo, may have joined the party a little later, first coming into contact with Cassandra in early 2011, but for Rolo the impact of Cassandra is simple - it brought distributed databases to everyone.

How does Cassandra compare to others in the NoSQL space?

Ellis believes Cassandra is uniquely suited for a hybrid world, and indeed, is “the best option if you are building a cloud application that needs real-time responses, always-on reliability, and scalable performance.” However, he notes that does come with some sacrifices in terms of ease-of-use: “like other cloud databases, Cassandra emphasizes denormalization over query-time joins, which is a hard concept for RDBMS developers to wrap their minds around.” Ultimately, it depends on how many you’re building for, if it’s just a few hundred or a few thousand users, you likely don’t need Cassandra.

For Rolo, the ease of scaling and enabling data (geo)distribution are major benefits that are often overlooked. Plus, it is proven at large scale, which other NoSQL databases still have to prove, and a tooling ecosystem is starting to appear which Cassandra has previously lacked in comparison to other databases. But ease of operation is a problem, says Rolo, “Cassandra deployments tend to get painful to manage if something is set wrong.”

Morton considers one of the major benefits to be the API using CQL which he describes as “stable, well documented, and easy for new users to pick up”. Further, it can run in almost any environment, providing great observability, and its stability means losing a server is rarely a problem. He describes Cassandra as “battle proven technology” explaining there’s a great deal of institutional knowledge in the Apache project. “Not every idea works out, and we've been through a few features that did not set the world on fire. Knowing what not to do is as important as knowing what to change.”   

What’s been the biggest challenge with Cassandra?

All three of the experts we spoke to agreed that one of the biggest challenges was changing the way people think about data. Ellis clarifies: “Most developers are exposed to the relational model and third normal form in college.  That’s still true today, ten years into the age of NoSQL. Once people get it, it’s like the light turns on, but it can be a challenge to get to that point--because in a distributed world, the rules aren’t just different, they’re upside down.”

For Rolo, over 80% of the challenges he faces with Cassandra result from companies/people trying to port their relational models over.

On the technical front, Morton explains, “the whole system has been re-written since version 1.0 which was basically the initial Facebook design. Starting with the networking protocol, moving to the creation of the CQL API, and finally re-writing the storage engine itself.” He credits the contributors who worked on these changes with setting the stage for Cassandra’s ongoing success.

Both Ellis and Rolo are concerned about a Cassandra skills shortage. Ellis believes the problem has worsened, citing job search engine Indeed as showing 5000+ jobs looking for Cassandra experience today versus around 800 in 2014. However, statistics for permanent job vacancies with a requirement for Apache Cassandra skills from IT Jobs Watch show a year-on-year decrease in the number of permanent jobs citing Apache Cassandra. Rolo believes the difference is between the development side and administration side: “a lot of development teams have already Cassandra skills. On the administration side I still think that is an issue, but I might be biased on this one!”

The next 10 years

Looking ahead to the next ten years, Ellis believes an up-and-coming area in the data space is Graph. “Of course, we’ve had graph databases for years, but they’ve never quite found a killer app, and I think a lot of that is due to early graph databases being limited in scale in a lot of the same ways relational databases were. My prediction is, you’re going to see improvements in both the fundamental graph technologies and in integration of graph with other data models like Cassandra’s tabular or document models.”

Morton thinks simplicity is key when looking at the future of Cassandra, explaining that reducing barriers to entry for not just Cassandra, but other distributed databases as well, makes it easier for new people to get involved, bringing new ideas that will help push the technology further. Ultimately, Morton believes, “Almost all databases will be distributed databases, just like almost all mobile phones are smart phones, and Cassandra will continue to be a large part of the larger ecosystem.”

For Rolo, the increase in Cassandra skills, together with improvements to Cassandra itself means it’s not going anywhere. His plan? “Keep rocking with Cassandra and the ecosystem surrounding it!”


«C-suite career advice: Stephen Parker, Parker Software


The CMO Files: Sarah Taylor, SmartFocus»