Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

6/4/2020

Reading time:7 min

How to Choose the Right NoSQL Database for Your Application? - DATAVERSITY

by John Doe

Click to learn more about author Akshay Pore.This is the second part of an on-going series on NoSQL Databases, the first part was NoSQL Data Architecture & Data Governance: Everything You Need to Know. In that first part, I explained different NoSQL Database types and provided a few use cases suitable for each type. But that is not sufficient when you are planning a new application and want to choose a database for your use case. Arriving at a decision is even more difficult when you see a variety of database vendors in the market today.In this article, I will outline a framework to perform a fit analysis to help you choose the right NoSQL Database for your application. The fit analysis comprises of four stages which will help you narrow down choices and arrive at a decision. But first, make sure at least high-level requirements, access paths and query patterns are elicited, analyzed and finalized for your application. This is critical as NoSQL Database types are designed for specific use case and their design is based on your application’s access paths and query patterns.RDBMS or NoSQL?The first stage is to determine if you really need a NoSQL Database or a RDBMS already being used in your organization will suffice. Understanding ACID vs BASE properties shall help you in this decision.As you may know a RDBMS is characterized by ACID properties which are:Atomic: Each task in a transaction succeeds or the entire transaction is rolled back.Consistent: A transaction maintains a valid state for the database before and after its completion and cannot leave the database in an inconsistent state.Isolated: A transaction not yet committed must not interfere with another transaction and must remain isolated.Durable: Committed transactions persist in the database and can be recovered in case of database failure.While these characteristics seem obvious for most of the applications, they are not suitable for horizontal scaling, high availability, performance and fault tolerance.The alternative to ACID is BASE which is what NoSQL databases follow:Basically Available: The system is guaranteed to be available in event of failure.Soft State: The state of the data could change without application interactions due to eventual consistency.Eventual Consistency: The system will be eventually consistent after the application input. The data will be replicated to different nodes and will eventually reach a consistent state. But the consistency is not guaranteed at a transaction level.The BASE systems allow horizontal scaling, fault tolerance and high availability at the cost of consistency. So, if your application requires high availability and scalability, a NoSQL Database built on BASE properties might be suitable.Other Factors to Consider While Choosing Between NoSQL and RDBMSChoose NoSQL if you have or need:Semi-structured or Unstructured data / flexible schemaLimited pre-defined access paths and query patternsNo complex queries, stored procedures, or viewsHigh velocity transactionsLarge volume of data (in Terabyte range) requiring quick and cheap scalabilityRequires distributed computing and storageNo Data Warehouse, Analytics or BI use casesChoose and RDBMS if you have or need:Consistent data/ACID transactionsComplex dynamic queries requiring stored procedures, or viewOption to migrate to another database without significant change to existing application’s access paths or logicData Warehouse, Analytics or BI use caseBased on the above considerations, if your application aligns better with the NoSQL’s BASE properties and other selection factors above, we can proceed to stage 2 and narrow NoSQL choices through CAP theorem.Narrow the NoSQL Choices Through CAP TheoremThe CAP Theorem quantifies tradeoffs between ACID and BASE and states that, in a distributed system, you can only have two out of the following three guarantees: Consistency, Availability, and Partition Tolerance, one of them will not be supported.Consistency: All nodes in the cluster have consistent data and a read request returns the most recent write from any node.Availability: A non-failing node must always respond to requests in a reasonable timePartition Tolerance: System continues to operate during network or node failures.As per CAP theorem, we must choose from CA, AP or CP characteristics for a given system. This offers a way to categorize databases and provides guidance on determining which database shall be a good fit for your application.Consistent and Available System: If your application requires high consistency and availability with no partition tolerance, a CA system is a good fit. Most of the traditional RDBMS are CA systems but we have ruled them out from our fit analysis in stage 1. A Graph Database such as Neo4j is also a CA system and will be analyzed in stage 3 of the fit analysis.Consistent and Partition Tolerant System: If your application requires high consistency and partition tolerance, a CP system is a good fit. CP systems are not able to guarantee availability as the system returns error until the partitioned state is resolved. Redis (K:V), MongoDB (Doc Store) and HBase (Col Oriented) are examples.Available and Partition Tolerant System: If your application requires high availability and partition tolerance, a AP system is a good fit. AP systems are not able to guarantee consistency as writes/updates can be made to either side of the partition. Such systems usually provide GDHA (Geographically Dispersed High Availability) where data is bi-directionally replicated across two datacenters and both are in Active-Active configuration i.e. application can write/read to/from either datacenter. Riak (K:V), Couchbase (Doc Store) and Cassandra (Col Oriented) are examples.After analyzing the CAP requirements for your application, you can narrow down to a set of NoSQL databases from the selected CAP category for further consideration in stage 3.Determine NoSQL Database TypeAs you may have noticed in stage 2, each CAP category contains more than one NoSQL Database types (K:V/Document Store/Column Oriented/Graph). In this stage, we further analyze the application purpose & use case to determine which NoSQL Database type should be considered from the CAP category chosen for your application.NoSQL Database types are designed for a specific group of use cases. I have listed some of the key use cases for each NoSQL Database type. You can use this list as a starting point for analyzing your application’s requirements.Choose K:V Stores if:Simple schemaHigh velocity read/write with no frequent updatesHigh performance and scalabilityNo complex queries involving multiple keys or joinsChoose Document Stores if:Flexible schema with complex queryingJSON/BSON or XML data formatsLeverage complex Indexes (multikey, geospatial, full text search etc)High performance and balanced R:W ratioChoose Column-Oriented Database if:High volume of dataExtreme write speeds with relatively less velocity readsData extractions by columns using row keysNo ad-hoc query patterns, complex indices or high level of aggregationsChoose Graph Database if:Applications requiring traversal between data pointsAbility to store properties of each data point as well as relationship between themComplex queries to determine relationships between data pointsNeed to detect patterns between data pointsNow you have decided the CAP category and the NoSQL type for your application. At this stage if we perform a fit analysis based on the select NoSQL databases shown in Fig 1, our decision matrix would look as follows:But as a last step, we also need to consider the database and technology characteristics of each NoSQL Database and the requirements from the application and organization to finalize a selection. These are detailed in step 4.Select NoSQL Database (Vendor)Even after selecting a CAP category and NoSQL Database type, the fit analysis is not complete. Selection of a NoSQL Database also depends on the database technology, its configuration and available infrastructure, proposed architecture of your application, budget as well as the skill set available at your organization etc.Database considerations:Backup and recovery configurationsCluster topology: GDHA / HADR, Active-Active / Active-PassiveReplication: Synchronous, Asynchronous or QuorumRead/Write concerns and Indexing strategiesConcurrency control: Locks, MVCC (Multi Version Concurrency Control), Read Your Own Write (RYOW)Security, access controls and encryption at restAvailable APIs and Query methods: JSON, XML, REST, Thrift, CQL, MapReduce, SPARQL, Cypher, Gremlin etc.Infrastructure: On-premise or Cloud / Dedicated or SharedDatabase uptime categorization (99.9% up to 99.999%)Architecture/Application considerations:Application Requirements: Use cases, R:W patterns, performance expectations/SLAs, upstream/downstream systems, criticality to the business etc.Implementation Language and SDKs: C/C++, Java, Python, Node.Js etcApplication Architecture: Web Application, Microservices, Mobile etc.Data Integration: Batch processing, ETL, Streaming, Message broker, ESB etc.Complementary Technologies: Spark, Storm, Kafka, ELK, Solr, Splunk etc.Organization considerations:Budget and cost considerationsTeam skillsetPreferred vendors / existing technology stackMotivation for NoSQL/Big DataBusiness / Technology leadership sponsorship & supportOnce all such questions are answered, the application and data team should shortlist a couple of NoSQL Database vendors and perform a Proof of Concept to evaluate the technology and benchmark the performance in order to finalize the selection.

Illustration Image

Click to learn more about author Akshay Pore.

This is the second part of an on-going series on NoSQL Databases, the first part was NoSQL Data Architecture & Data Governance: Everything You Need to Know. In that first part, I explained different NoSQL Database types and provided a few use cases suitable for each type. But that is not sufficient when you are planning a new application and want to choose a database for your use case. Arriving at a decision is even more difficult when you see a variety of database vendors in the market today.

In this article, I will outline a framework to perform a fit analysis to help you choose the right NoSQL Database for your application. The fit analysis comprises of four stages which will help you narrow down choices and arrive at a decision. But first, make sure at least high-level requirements, access paths and query patterns are elicited, analyzed and finalized for your application. This is critical as NoSQL Database types are designed for specific use case and their design is based on your application’s access paths and query patterns.

RDBMS or NoSQL?

The first stage is to determine if you really need a NoSQL Database or a RDBMS already being used in your organization will suffice. Understanding ACID vs BASE properties shall help you in this decision.

As you may know a RDBMS is characterized by ACID properties which are:

  • Atomic: Each task in a transaction succeeds or the entire transaction is rolled back.
  • Consistent: A transaction maintains a valid state for the database before and after its completion and cannot leave the database in an inconsistent state.
  • Isolated: A transaction not yet committed must not interfere with another transaction and must remain isolated.
  • Durable: Committed transactions persist in the database and can be recovered in case of database failure.

While these characteristics seem obvious for most of the applications, they are not suitable for horizontal scaling, high availability, performance and fault tolerance.

The alternative to ACID is BASE which is what NoSQL databases follow:

  • Basically Available: The system is guaranteed to be available in event of failure.
  • Soft State: The state of the data could change without application interactions due to eventual consistency.
  • Eventual Consistency: The system will be eventually consistent after the application input. The data will be replicated to different nodes and will eventually reach a consistent state. But the consistency is not guaranteed at a transaction level.

The BASE systems allow horizontal scaling, fault tolerance and high availability at the cost of consistency. So, if your application requires high availability and scalability, a NoSQL Database built on BASE properties might be suitable.

Other Factors to Consider While Choosing Between NoSQL and RDBMS

Choose NoSQL if you have or need:

  1. Semi-structured or Unstructured data / flexible schema
  2. Limited pre-defined access paths and query patterns
  3. No complex queries, stored procedures, or views
  4. High velocity transactions
  5. Large volume of data (in Terabyte range) requiring quick and cheap scalability
  6. Requires distributed computing and storage
  7. No Data Warehouse, Analytics or BI use cases

Choose and RDBMS if you have or need:

  1. Consistent data/ACID transactions
  2. Complex dynamic queries requiring stored procedures, or view
  3. Option to migrate to another database without significant change to existing application’s access paths or logic
  4. Data Warehouse, Analytics or BI use case

Based on the above considerations, if your application aligns better with the NoSQL’s BASE properties and other selection factors above, we can proceed to stage 2 and narrow NoSQL choices through CAP theorem.

Narrow the NoSQL Choices Through CAP Theorem

The CAP Theorem quantifies tradeoffs between ACID and BASE and states that, in a distributed system, you can only have two out of the following three guarantees: Consistency, Availability, and Partition Tolerance, one of them will not be supported.

  • Consistency: All nodes in the cluster have consistent data and a read request returns the most recent write from any node.
  • Availability: A non-failing node must always respond to requests in a reasonable time
  • Partition Tolerance: System continues to operate during network or node failures.

As per CAP theorem, we must choose from CA, AP or CP characteristics for a given system. This offers a way to categorize databases and provides guidance on determining which database shall be a good fit for your application.

  • Consistent and Available System: If your application requires high consistency and availability with no partition tolerance, a CA system is a good fit. Most of the traditional RDBMS are CA systems but we have ruled them out from our fit analysis in stage 1. A Graph Database such as Neo4j is also a CA system and will be analyzed in stage 3 of the fit analysis.
  • Consistent and Partition Tolerant System: If your application requires high consistency and partition tolerance, a CP system is a good fit. CP systems are not able to guarantee availability as the system returns error until the partitioned state is resolved. Redis (K:V), MongoDB (Doc Store) and HBase (Col Oriented) are examples.
  • Available and Partition Tolerant System: If your application requires high availability and partition tolerance, a AP system is a good fit. AP systems are not able to guarantee consistency as writes/updates can be made to either side of the partition. Such systems usually provide GDHA (Geographically Dispersed High Availability) where data is bi-directionally replicated across two datacenters and both are in Active-Active configuration i.e. application can write/read to/from either datacenter. Riak (K:V), Couchbase (Doc Store) and Cassandra (Col Oriented) are examples.

After analyzing the CAP requirements for your application, you can narrow down to a set of NoSQL databases from the selected CAP category for further consideration in stage 3.

Determine NoSQL Database Type

As you may have noticed in stage 2, each CAP category contains more than one NoSQL Database types (K:V/Document Store/Column Oriented/Graph). In this stage, we further analyze the application purpose & use case to determine which NoSQL Database type should be considered from the CAP category chosen for your application.

NoSQL Database types are designed for a specific group of use cases. I have listed some of the key use cases for each NoSQL Database type. You can use this list as a starting point for analyzing your application’s requirements.

Choose K:V Stores if:

  1. Simple schema
  2. High velocity read/write with no frequent updates
  3. High performance and scalability
  4. No complex queries involving multiple keys or joins

Choose Document Stores if:

  1. Flexible schema with complex querying
  2. JSON/BSON or XML data formats
  3. Leverage complex Indexes (multikey, geospatial, full text search etc)
  4. High performance and balanced R:W ratio

Choose Column-Oriented Database if:

  1. High volume of data
  2. Extreme write speeds with relatively less velocity reads
  3. Data extractions by columns using row keys
  4. No ad-hoc query patterns, complex indices or high level of aggregations

Choose Graph Database if:

  1. Applications requiring traversal between data points
  2. Ability to store properties of each data point as well as relationship between them
  3. Complex queries to determine relationships between data points
  4. Need to detect patterns between data points

Now you have decided the CAP category and the NoSQL type for your application. At this stage if we perform a fit analysis based on the select NoSQL databases shown in Fig 1, our decision matrix would look as follows:

But as a last step, we also need to consider the database and technology characteristics of each NoSQL Database and the requirements from the application and organization to finalize a selection. These are detailed in step 4.

Select NoSQL Database (Vendor)

Even after selecting a CAP category and NoSQL Database type, the fit analysis is not complete. Selection of a NoSQL Database also depends on the database technology, its configuration and available infrastructure, proposed architecture of your application, budget as well as the skill set available at your organization etc.

Database considerations:

  1. Backup and recovery configurations
  2. Cluster topology: GDHA / HADR, Active-Active / Active-Passive
  3. Replication: Synchronous, Asynchronous or Quorum
  4. Read/Write concerns and Indexing strategies
  5. Concurrency control: Locks, MVCC (Multi Version Concurrency Control), Read Your Own Write (RYOW)
  6. Security, access controls and encryption at rest
  7. Available APIs and Query methods: JSON, XML, REST, Thrift, CQL, MapReduce, SPARQL, Cypher, Gremlin etc.
  8. Infrastructure: On-premise or Cloud / Dedicated or Shared
  9. Database uptime categorization (99.9% up to 99.999%)

Architecture/Application considerations:

  1. Application Requirements: Use cases, R:W patterns, performance expectations/SLAs, upstream/downstream systems, criticality to the business etc.
  2. Implementation Language and SDKs: C/C++, Java, Python, Node.Js etc
  3. Application Architecture: Web Application, Microservices, Mobile etc.
  4. Data Integration: Batch processing, ETL, Streaming, Message broker, ESB etc.
  5. Complementary Technologies: Spark, Storm, Kafka, ELK, Solr, Splunk etc.

Organization considerations:

  1. Budget and cost considerations
  2. Team skillset
  3. Preferred vendors / existing technology stack
  4. Motivation for NoSQL/Big Data
  5. Business / Technology leadership sponsorship & support

Once all such questions are answered, the application and data team should shortlist a couple of NoSQL Database vendors and perform a Proof of Concept to evaluate the technology and benchmark the performance in order to finalize the selection.

Related Articles

mongo
rest
cassandra

GitHub - stargate/stargate: An open source data gateway

John Doe

3/7/2024

cassandra
nosql

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

mongo