Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

7/2/2020

Reading time:39 mins

Dial C* for Operator - Creating a Cassandra Cluster with Cass Operator

by John Doe

In this post we are going to take a deep dive look at provisioning a Cassandra cluster using the DataStax Kubernetes operator for Cassandra, Cass Operator. We will set up a multi-rack cluster with each rack in a different availability zone.For the examples, I will use a nine node, regional cluster in Google Kubernetes Engine (GKE) that is spread across three zones. Here is what my Kubernetes cluster looks like:$ kubectl get nodes --label-columns failure-domain.beta.kubernetes.io/region,failure-domain.beta.kubernetes.io/zone | awk {'print $1" "$6" "$7'} | column -tNAME REGION ZONEgke-cass-dev-default-pool-3cab2f1f-3swp us-east1 us-east1-dgke-cass-dev-default-pool-3cab2f1f-408v us-east1 us-east1-dgke-cass-dev-default-pool-3cab2f1f-pv6v us-east1 us-east1-dgke-cass-dev-default-pool-63ec3f9d-5781 us-east1 us-east1-bgke-cass-dev-default-pool-63ec3f9d-blrh us-east1 us-east1-bgke-cass-dev-default-pool-63ec3f9d-g4cb us-east1 us-east1-bgke-cass-dev-default-pool-b1ee1c3c-5th7 us-east1 us-east1-cgke-cass-dev-default-pool-b1ee1c3c-ht20 us-east1 us-east1-cgke-cass-dev-default-pool-b1ee1c3c-xp2v us-east1 us-east1-cWithout getting into too much detail, I want to quickly cover some fundamental concepts for some of the things we will discuss in this post. Kubernetes is made up of controllers. A controller manages the state of one more Kubernetes resource types. The controller executes an infinite loop continually trying to converge the desired state of resources with their actual state. The controller watches for changes of interest in the Kubernetes cluster, i.e., a resource added, deleted, or updated. When there is a change, a key uniquely identifying the effected resource is added to a work queue. The controller eventually gets the key from the queue and begins whatever work is necessary.Sometimes a controller has to perform potentially long-running operations like pulling an image from a remote registry. Rather than blocking until the operation completes, the controller usually requeues the key so that it can continue with other work while the operation completes in the background. When there is no more work to do for a resource, i.e. the desired state matches the actual state, the controller removes the key from the work queue.An operator consists of one or more controllers that manage the state of one or more custom resources. Every controller has a Reconciler object that implements a reconcile loop. The reconcile loop is passed a request, which is the resource key.A Kubernetes worker node, Kubernetes worker, or worker node is a machine that runs services necessary to run and manage pods. These services include:kubelet kube-proxy container runtime, e.g., DockerA Cassandra node is the process running in a container.A Cassandra container is the container, i.e., Docker container, in which the Cassandra node is running.A Cassandra pod is a Kubernetes pod that includes one more containers. One of those containers is running the Cassandra node.Apply the cass-operator-manifests.yaml manifests as follows:$ kubectl create -f https://raw.githubusercontent.com/datastax/cass-operator/b96bfd77775b5ba909bd9172834b4a56ef15c319/docs/user/cass-operator-manifests.yamlnamespace/cass-operator createdserviceaccount/cass-operator createdsecret/cass-operator-webhook-config createdcustomresourcedefinition.apiextensions.k8s.io/cassandradatacenters.cassandra.datastax.com createdclusterrole.rbac.authorization.k8s.io/cass-operator-cluster-role createdclusterrolebinding.rbac.authorization.k8s.io/cass-operator createdrole.rbac.authorization.k8s.io/cass-operator createdrolebinding.rbac.authorization.k8s.io/cass-operator createdservice/cassandradatacenter-webhook-service createddeployment.apps/cass-operator createdvalidatingwebhookconfiguration.admissionregistration.k8s.io/cassandradatacenter-webhook-registration createdNote: The operator is deployed in the cass-operator namespace.Make sure that the operator has deployed successfully. You should see output similar to this:$ kubectl -n cass-operator get deploymentsNAME READY UP-TO-DATE AVAILABLE AGEcass-operator 1/1 1 1 2m8sWe need to create a StorageClass that is suitable for Cassandra. Place the following in a file named server-storageclass.yaml:apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: serverprovisioner: kubernetes.io/gce-pdparameters: type: pd-ssd replication-type: nonevolumeBindingMode: WaitForFirstConsumerreclaimPolicy: DeleteOne thing thing to note here is volumeBindingMode: WaitForFirstConsumer. The default value is Immediate and should not be used. It can prevent Cassandra pods from being scheduled on a worker node. If a pod fails to run and its status reports a message like, had volume node affinity conflict, then check the volumeBindingMode of the StorageClass being used. See Topology-Aware Volume Provisioning in Kubernetes for more details.Create the StorageClass with:$ kubectl -n cass-operator apply -f server-storageclass.yamlstorageclass.storage.k8s.io/server-storage createdMost Kubernetes resources define spec and status properties. The spec declares the desired state of a resource which includes configuration settings provided by the user, default values expanded by the system, and other properties initialized by other internal components after resource creation. We will talk about the status in a little bit.The manifest below declares a CassandraDatacenter custom resource. It does not include all possible properties. It includes the minimum necessary to create a multi-zone cluster.apiVersion: cassandra.datastax.com/v1beta1kind: CassandraDatacentermetadata: name: multi-rackspec: clusterName: multi-rack serverType: cassandra serverVersion: 3.11.6 managementApiAuth: insecure: {} size: 9 racks: - name: us-east1-b zone: us-east1-b - name: us-east1-c zone: us-east1-c - name: us-east1-d zone: us-east1-d storageConfig: cassandraDataVolumeClaimSpec: storageClassName: standard accessModes: - ReadWriteOnce resources: requests: storage: 5GiThis spec declares a single Cassandra datacenter. Cass Operator does support multi-DC clusters. It requires creating a separate CassandraDatacenter for each datacenter. Discussion of multi-DC clusters is outside the scope of this post.The size property specifies the total number of Cassandra nodes in the datacenter.racks is an array of Rack objects which consist of name and zone properties. The zone should be the name of a zone in GCP (or an AWS zone if the cluster was running in AWS for example). The operator will use this to pin Cassandra pods to Kubernetes workers in the zone. More on this later.Put the above manifest in a file named multi-rack-cassdc.yaml and then run:$ kubectl -n cass-operator apply -f multi-rack-cassdc.yamlcassandradatacenter.cassandra.datastax.com/multi-rack createdThis creates a CassandraDatacenter object named multi-rack in the Kubernetes API server. The API server provides a REST API with which clients, like kubectl, interact. The API server maintains state in etcd. Creating a Kubernetes resource ultimately means persisting state in etcd. When the CassandraDatacenter object is persisted, the API server notifies any clients watching for changes, namely Cass Operator. From here the operator takes over. The new object is added to the operator’s internal work queue. The job of the operator is to make sure the desired state, i.e., the spec, matches the actual state of the CassandraDatacenter.Now that we have created the CassandraDatacenter, it is time to focus our attention on what Cass Operator is doing to build the Cassandra cluster.We will look at a couple things to monitor the progress of the provisioning or scaling up of the cluster:Changes in the status of the CassandraDatacenter Kubernetes events emitted by the operatorWe have already discussed that the spec describes a resource’s desired state. The status on the other hand, describes the object’s current, observed state. Earlier I mentioned that the Kubernetes API server provides a REST API to clients. A Kubernetes object or resource is a REST resource. The status of a Kubernetes resource is typically implemented as a REST subresource that can only be modified by internal, system components. In the case of a CassandraDatacenter, Cass Operator manages the status property.An event is a Kubernetes resource that is created when objects like pods change state, or when an error occurs. Like other resources, events get stored in the API server. Cass Operator generates a number of events for a CassandraDatacenter.Understandng both the changes in a CassandraDatacenter’s status and the events emitted by the operator provide valuable insight into what is actually happening during the provisioning process. That understanding also makes it easier to resolve issues when things go wrong. This applies not only to CassandraDatacenter, but also to other Kubernetes resource as well.We can watch for changes in the status with:$ kubectl -n cass-operator get -w cassdc multi-rack -o yamlIn the following sections we will discuss each of the status updates that occur while the operator works to create the Cassandra cluster.Here is what the status looks like initially after creating the CassandraDatacenter:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" nodeStatuses: {}cassandraOperatorProgress can have one of two values, Ready or Updating. It will change to Ready when the operator has no more work to do for the resource. This simple detail is really important, particularly if you are performing any automation with Cass Operator. For example, I have used Cassandra operators to provision clusters for integration tests. With Cass Operator my test setup code could simply poll cassandraOperatorProgress to know when the cluster is ready.conditions is an array of DatacenterCondition objects. A lot of Kubernetes resources use conditions in their statuses. Conditions represent the latest observations of an object’s state. They should minimally include type and status fields. The status field can have as its value either True, False, or Unknown. lastTransitionTime is the time the condition transitioned from one status to another. type identifies the condition. CassandraDatacenter currently has the following condition types:Ready Initialized ReplacingNodes ScalingUp Updating Stopped Resuming RollingRestartImplementing, understanding, and using conditions are often points of confusion. It is intuitive to think of and model a resource’s state as a state machine. Reminding yourself conditions are observations and not a state machine will go a long way in avoiding some of that confusion. It is worth noting there has been a lot of debate in the Kubernetes community about whether conditions should be removed. Some of the latest discussions in this ticket indicate that they are will remain.lastRollingRestart is only updated when a rolling restart is explicitly requested. As we will see its value will remain unchanged, and therefore we will be ignoring it for this post.nodeStatuses is a map that provides some details for each node. We will see it get updated as nodes are deployed.Cassandra Node StartingWith the next update we see that a lastServerNodeStarted property has been added to the status:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:41:24Z" nodeStatuses: {}lastServerNodeStarted gets updated when a Cassandra node is starting up. The operator also adds the label cassandra.datastax.com/node-state: Starting to the Cassandra pod. The astute reader may have noted that I said the lastServerNodeStarted is updated when Cassandra node is starting up rather than when the pod is starting up. For Cass Operator, there is a important distinction between the Cassandra node and the Cassandra container. The Cassandra Container section at the end of the post goes over this in some detail.Cassandra Node StartedIn the next update lastServerNodeStarted is modified and another entry is added to nodeStatuses:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:41:50Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5The entry is keyed by the pod name, multi-rack-multi-rack-us-east1-b-sts-2. The value consists of two fields - the node’s host id and its IP address.When Cass Operator determines the Cassandra node is up running, it updates the node-state label to cassandra.datastax.com/node-state: Started. After the label update, the operator uses a label selector query to see which pods have been started and are running. When the operator finds another node running, its host ID and IP address will be added to nodeStatuses.Remaining Nodes StartedIn this section we follow the progression of the rest of the Cassandra cluster being started. lastServerNodeStarted is changed with each of these status updates in addition to nodeStatuses being updated.multi-rack-multi-rack-us-east1-c-sts-0 is started:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:42:49Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3Next, multi-rack-multi-rack-us-east1-d-sts-0 is started:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:43:53Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4Next, multi-rack-multi-rack-us-east1-c-sts-2 is started:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:44:54Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4Next, multi-rack-multi-rack-us-east1-d-sts-0 is started:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:45:50Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3With five out of the nine nodes started, now is a good time point out a couple things. First, we see one node at a time is added to nodeStatuses. Based on this it stands to reason Cass Operator is starting nodes serially. That is precisely what is happening.Secondly, there is roughly a minute difference between the values of lastServerNodeStarted in each status update. It is taking about a minute or so to start each node, which means it should take somewhere between nine and ten minutes for the cluster to be ready. These times will almost certainly vary depending on a number of factors like the type of disks used, the machine type, etc. It is helpful though, particularly for larger cluster sizes, to be able to gauge how long it will take to get the entire cluster up and running.Next, multi-rack-multi-rack-us-east1-d-sts-2 is started:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:46:51Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3 multi-rack-multi-rack-us-east1-d-sts-2: hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49 nodeIP: 10.32.5.4Next, multi-rack-multi-rack-us-east1-b-sts- is started:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:48:00Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-0: hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a nodeIP: 10.32.1.4 multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3 multi-rack-multi-rack-us-east1-d-sts-2: hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49 nodeIP: 10.32.5.4Next, multi-rack-multi-rack-us-east1-c-sts-1 is started:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:48:57Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-0: hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a nodeIP: 10.32.1.4 multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-1: hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31 nodeIP: 10.32.7.6 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3 multi-rack-multi-rack-us-east1-d-sts-2: hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49 nodeIP: 10.32.5.4Finally, multi-rack-multi-rack-us-east1-b-sts-1 is started:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:48:57Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-0: hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a nodeIP: 10.32.1.4 multi-rack-multi-rack-us-east1-b-sts-1: hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef nodeIP: 10.32.2.6 multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-1: hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31 nodeIP: 10.32.7.6 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3 multi-rack-multi-rack-us-east1-d-sts-2: hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49 nodeIP: 10.32.5.4Although all nine nodes are now started, the operator still has more work to do. This is evident based on the ScalingUp condition still being True and cassandraOperatorProgress still having a value of Updating.Cassandra Super User CreatedWith the next update the superUserUpserted property is added to the status:status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:40:51Z" status: "True" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:48:57Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-0: hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a nodeIP: 10.32.1.4 multi-rack-multi-rack-us-east1-b-sts-1: hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef nodeIP: 10.32.2.6 multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-1: hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31 nodeIP: 10.32.7.6 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3 multi-rack-multi-rack-us-east1-d-sts-2: hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49 nodeIP: 10.32.5.4 superUserUpserted: "2020-05-06T16:49:55Z"superUserUpserted is the timestamp at which the operator creates a super user in Cassandra. We will explore this in a little more detail when we go through the events.ScalingUp TransitionIn this update the ScalingUp condition transitions to False. This condition changes only after all nodes have been started and after the super user has been created.status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:49:55Z" status: "False" type: ScalingUp lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:48:57Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-0: hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a nodeIP: 10.32.1.4 multi-rack-multi-rack-us-east1-b-sts-1: hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef nodeIP: 10.32.2.6 multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-1: hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31 nodeIP: 10.32.7.6 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3 multi-rack-multi-rack-us-east1-d-sts-2: hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49 nodeIP: 10.32.5.4 superUserUpserted: "2020-05-06T16:49:55Z"Add Initialized and Ready ConditionsNext, the operator adds the Initialized and Ready conditions to the status. Initialized means the CassandraDatacenter was successfully created. The transition for this condition should only happen once. Ready means the cluster can start serving client requests. The Ready condition will remain True during a rolling restart for example but will transition to False when all nodes are stopped. See The Cassandra Container section at the end of the post for more details on starting and stopping Cassandra nodes.status: cassandraOperatorProgress: Updating conditions: - lastTransitionTime: "2020-05-06T16:49:55Z" status: "False" type: ScalingUp - lastTransitionTime: "2020-05-06T16:49:55Z" status: "True" type: Initialized - lastTransitionTime: "2020-05-06T16:49:55Z" status: "True" type: Ready lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:48:57Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-0: hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a nodeIP: 10.32.1.4 multi-rack-multi-rack-us-east1-b-sts-1: hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef nodeIP: 10.32.2.6 multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-1: hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31 nodeIP: 10.32.7.6 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3 multi-rack-multi-rack-us-east1-d-sts-2: hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49 nodeIP: 10.32.5.4 superUserUpserted: "2020-05-06T16:49:55Z"In the last update, the value of cassandraOperatorProgress is changed to Ready:status: cassandraOperatorProgress: Ready conditions: - lastTransitionTime: "2020-05-06T16:49:55Z" status: "False" type: ScalingUp - lastTransitionTime: "2020-05-06T16:49:55Z" status: "True" type: Initialized - lastTransitionTime: "2020-05-06T16:49:55Z" status: "True" type: Ready lastRollingRestart: "2020-05-06T16:40:51Z" lastServerNodeStarted: "2020-05-06T16:48:57Z" nodeStatuses: multi-rack-multi-rack-us-east1-b-sts-0: hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a nodeIP: 10.32.1.4 multi-rack-multi-rack-us-east1-b-sts-1: hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef nodeIP: 10.32.2.6 multi-rack-multi-rack-us-east1-b-sts-2: hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd nodeIP: 10.32.0.5 multi-rack-multi-rack-us-east1-c-sts-0: hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 nodeIP: 10.32.6.3 multi-rack-multi-rack-us-east1-c-sts-1: hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31 nodeIP: 10.32.7.6 multi-rack-multi-rack-us-east1-c-sts-2: hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756 nodeIP: 10.32.8.5 multi-rack-multi-rack-us-east1-d-sts-0: hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d nodeIP: 10.32.4.4 multi-rack-multi-rack-us-east1-d-sts-1: hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf nodeIP: 10.32.3.3 multi-rack-multi-rack-us-east1-d-sts-2: hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49 nodeIP: 10.32.5.4 superUserUpserted: "2020-05-06T16:49:55Z"We now know the operator has completed its work to scale up the cluster. We also know the cluster is initialized and ready for use. Let’s verify the desired state of the CassandraDatacenter matches actual state. We can do this with nodetool status and kubectl get nodes.$ kubectl -n cass-operator exec -it multi-rack-multi-rack-us-east1-b-sts-0 -c cassandra -- nodetool statusDatacenter: multi-rack======================Status=Up/Down|/ State=Normal/Leaving/Joining/Moving-- Address Load Tokens Owns (effective) Host ID RackUN 10.32.4.4 84.43 KiB 1 4.8% c7e43757-92ee-4ca3-adaa-46a128045d4d us-east1-dUN 10.32.1.4 70.2 KiB 1 7.4% 3b1b60e0-62c6-47fb-93ff-3d164825035a us-east1-bUN 10.32.6.3 65.36 KiB 1 32.5% dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76 us-east1-cUN 10.32.3.3 103.54 KiB 1 34.0% 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf us-east1-dUN 10.32.7.6 70.34 KiB 1 18.1% a55082ba-0692-4ee9-97a2-a1bb16383d31 us-east1-cUN 10.32.8.5 65.36 KiB 1 19.8% facbbaa0-ffa7-403c-b323-e83e4cab8756 us-east1-cUN 10.32.2.6 65.36 KiB 1 36.5% d7246bca-ae64-45ec-8533-7c3a2540b5ef us-east1-bUN 10.32.0.5 65.36 KiB 1 39.9% 62399b3b-80f0-42f2-9930-6c4f2477c9bd us-east1-bUN 10.32.5.4 65.36 KiB 1 7.0% 8e8733ab-6f7b-4102-946d-c855adaabe49 us-east1-dnodetool status reports nine nodes up across three racks. That looks good. Now let’s verify the pods are running where we expect them to be.$ kubectl -n cass-operator get pods -l "cassandra.datastax.com/cluster=multi-rack" -o wide | awk {'print $1" "$7'} | column -tNAME NODEmulti-rack-multi-rack-us-east1-b-sts-0 gke-cass-dev-default-pool-63ec3f9d-5781multi-rack-multi-rack-us-east1-b-sts-1 gke-cass-dev-default-pool-63ec3f9d-blrhmulti-rack-multi-rack-us-east1-b-sts-2 gke-cass-dev-default-pool-63ec3f9d-g4cbmulti-rack-multi-rack-us-east1-c-sts-0 gke-cass-dev-default-pool-b1ee1c3c-5th7multi-rack-multi-rack-us-east1-c-sts-1 gke-cass-dev-default-pool-b1ee1c3c-ht20multi-rack-multi-rack-us-east1-c-sts-2 gke-cass-dev-default-pool-b1ee1c3c-xp2vmulti-rack-multi-rack-us-east1-d-sts-0 gke-cass-dev-default-pool-3cab2f1f-3swpmulti-rack-multi-rack-us-east1-d-sts-1 gke-cass-dev-default-pool-3cab2f1f-408vmulti-rack-multi-rack-us-east1-d-sts-2 gke-cass-dev-default-pool-3cab2f1f-pv6vLook carefully at the output, and you will see each pod is in fact running on a separate worker node. Furthermore, the pods are running on worker nodes in the expected zones.The operator reports a number of events useful for monitoring and debugging the provisioning process. As we will see, the events provide additional insights absent from the status updates alone.There are some nuances with events that can make working with them a bit difficult. First, events are persisted with a TTL. They expire after one hour. Secondly, events can be listed out of order. The ordering appears to be done on the client side with a sort on the Age column. We will go through the events in the order in which the operator generates them. Lastly, while working on this post, I discovered that some events can get dropped. I created this ticket to investigate the issue. Kubernetes has in place some throttling mechanisms to prevent the system from getting overwhelmed by too many events. We won’t go through every single event as there are a lot. We will however cover enough, including some that may be dropped, in order to get an overall sense of what is going on.We can list all of the events for the CassandraDatacenter with the describe command as follows:$ kubectl -n cass-operator describe cassdc multi-rackEvents: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingUpRack 12m cass-operator Scaling up rack us-east1-b Normal CreatedResource 12m cass-operator Created service multi-rack-seed-service Normal CreatedResource 12m cass-operator Created service multi-rack-multi-rack-all-pods-service Normal CreatedResource 12m cass-operator Created statefulset multi-rack-multi-rack-us-east1-b-sts Normal CreatedResource 12m cass-operator Created statefulset multi-rack-multi-rack-us-east1-c-sts Normal CreatedResource 12m cass-operator Created statefulset multi-rack-multi-rack-us-east1-d-sts Normal CreatedResource 12m cass-operator Created service multi-rack-multi-rack-service Normal ScalingUpRack 12m cass-operator Scaling up rack us-east1-c Normal ScalingUpRack 12m cass-operator Scaling up rack us-east1-d Normal LabeledPodAsSeed 12m cass-operator Labeled pod a seed node multi-rack-multi-rack-us-east1-b-sts-2 Normal StartingCassandra 12m cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-2 Normal StartedCassandra 11m cass-operator Started Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-2 Normal StartingCassandra 11m cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-c-sts-0 Normal StartingCassandra 10m cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-0 Normal StartedCassandra 10m cass-operator Started Cassandra for pod multi-rack-multi-rack-us-east1-c-sts-0 Normal LabeledPodAsSeed 10m cass-operator Labeled as seed node pod multi-rack-multi-rack-us-east1-c-sts-0 Normal LabeledPodAsSeed 9m44s cass-operator Labeled as seed node pod multi-rack-multi-rack-us-east1-d-sts-0 Normal StartedCassandra 9m43s cass-operator Started Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-0 Normal StartingCassandra 9m43s cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-c-sts-2 Normal StartedCassandra 8m43s cass-operator Started Cassandra for pod multi-rack-multi-rack-us-east1-c-sts-2 Normal StartingCassandra 8m43s cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-1 Normal StartedCassandra 7m47s cass-operator Started Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-1 Normal StartingCassandra 7m46s cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-2 Normal StartedCassandra 6m45s cass-operator Started Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-2 Normal StartingCassandra 6m45s cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-0 Normal LabeledPodAsSeed 5m36s cass-operator Labeled as seed node pod multi-rack-multi-rack-us-east1-b-sts-0In the following sections we will go through several of these events as well as some that are missing.The first thing that Cass Operator does during the initial reconciliation loop is create a few headless services: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatedResource 10m cass-operator Created service multi-rack-seed-service Normal CreatedResource 10m cass-operator Created service multi-rack-multi-rack-all-pods-service Normal CreatedResource 10m cass-operator Created service multi-rack-multi-rack-service multi-rack-seed-service exposes all pods running seed nodes. This service is used by Cassandra to configure seed nodes.multi-rack-multi-rack-all-pods-service exposes all pods that are part of the CassandraDatacenter, regardless of whether they are ready. It is used to scrape metrics with Prometheus.multi-rack-multi-rack-service exposes ready pods. CQL clients should use this service to establish connections to the cluster.Next the operator creates three StatefulSets, one for each rack: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatedResource 12m cass-operator Created statefulset multi-rack-multi-rack-us-east1-b-sts Normal CreatedResource 12m cass-operator Created statefulset multi-rack-multi-rack-us-east1-c-sts Normal CreatedResource 12m cass-operator Created statefulset multi-rack-multi-rack-us-east1-d-stsI mentioned earlier the operator will use the zone property specified for each rack to pin pods to Kubernetes workers in the respective zones. The operator uses affinity rules to accomplish this.Let’s take a look at the spec for multi-rack-multi-rack-us-east1-c-sts to see how this is accomplished:$ kubectl -n cass-operator get sts multi-rack-multi-rack-us-east1-c-sts -o yaml... spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: failure-domain.beta.kubernetes.io/zone operator: In values: - us-east1-c podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: cassandra.datastax.com/cluster operator: Exists - key: cassandra.datastax.com/datacenter operator: Exists - key: cassandra.datastax.com/rack operator: Exists topologyKey: kubernetes.io/hostname... The nodeAffinity property constrains the worker nodes on which pods in the StatefulSet can be scheduled. requiredDuringSchedulingIgnoredDuringExecution is a NodeSelector which basically declares a query based on labels. In this case, if a node has the label failure-domain.beta.kubernetes.io/zone with a value of us-east1-c, then pods can be scheduled on that node.Note: failure-domain.beta.kubernetes.io/zone is one of a number of well known labels that are used by the Kubernetes runtime.I added emphasis to can be because of the podAntiAffinity property that is declared. It constrains the worker nodes on which the pods can be scheduled based on the labels of pods currently running on the nodes. The requiredDuringSchedulingIgnoredDuringExecution property is a PodAffinityTerm that defines labels that determine which pods cannot be co-located on a particular host. In short, this prevents pods from being scheduled on any node on which pods from a CassandraDatacenter are already running. In other words, no two Cassandra nodes should be running on the same Kubernetes worker node.Note: You can run multiple Cassandra pods on a single worker node by setting .spec.allowMultipleNodesPerWorker to true.Scale up the RacksThe next events involve scaling up the racks:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingUpRack 12m cass-operator Scaling up rack us-east1-b Normal ScalingUpRack 12m cass-operator Scaling up rack us-east1-c Normal ScalingUpRack 12m cass-operator Scaling up rack us-east1-dThe StatefulSets are initially created with zero replicas. They are subsequently scaled up to the desired replica count, which is three (per StatefulSet) in this case.Label the First Seed Node PodAfter the StatefulSet controller starts creating pods, Cass Operator applies the following label to a pod to designate it as a Cassandra seed node:cassandra.datastax.com/seed-node: "true"At this stage in the provisioning process, no pods have the seed-node label. The following event indicates that the operator designates the pod to be a seed node:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal LabeledPodAsSeed 12m cass-operator Labeled pod a seed node multi-rack-multi-rack-us-east1-b-sts-2 Note: You can use a label selector to query for all seed node pods, e.g., kubectl -n cass-operator get pods -l cassandra.datastax.com/seed-node="true".Start the First Seed NodeNext the operator starts the first seed node:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal StartingCassandra 12m cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-2The operator applies the label cassandra.datastax.com/node-state: Starting to the pod. The operator then requeues the request with a short delay, allowing time for the Cassandra node to start. Requeuing the request ends the current reconciliation.If you are familiar with Kubernetes, this step of starting the Cassandra node may seem counter-intuitive because pods/containers cannot exist in a stopped state. See The Cassandra Container section at the end of the post for more information.In a subsequent reconciliation loop the operator finds that multi-rack-multi-rack-us-east1-b-sts-2 has been started and records the following event:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal StartedCassandra 11m cass-operator Started Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-2Then the cassandra.datastax.com/node-state label is updated to a value of Started to indicate the Cassandra node is now running. The event is recorded and the labeled is updated only when the Cassandra container’s readiness probe passes. If the readiness probe fails, the operator will requeue the request, ending the current reconciliation loop.Start One Node Per RackAfter the first node, multi-rack-multi-rack-us-east1-b-sts-2, is running, the operator makes sure there is a node per rack running. Here is the sequence of events for a given node:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal StartingCassandra 8m43s cass-operator Starting Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-1 Normal StartedCassandra 7m47s cass-operator Started Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-1 Normal LabeledPodAsSeed 9m44s cass-operator Labeled as seed node pod multi-rack-multi-rack-us-east1-d-sts-1 Let’s break down what is happening here.The cassandra.datastax.com/node-state: Starting label is applied to multi-rack-multi-rack-us-east1-d-sts-1 Cassandra is started The request is requeued On a subsequent reconciliation loop when Cassandra is running (as determined by the readiness probe), two things happen The cassandra.datastax.com/seed-node="true" label is applied to the pod, making it a seed node The cassandra.datastax.com/node-state label is updated to a value of Started The operator will repeat this process for another rack which does not yet have a node running.Now is a good time to discuss how the operator determines how many seeds there should be in total for the datacenter as well as how many seeds there should be per rack.If the datacenter consists of only one or two nodes, then there will be one or two seeds respectively. If there are more than three racks, then the number of seeds will be set to the number of racks. If neither of those conditions hold, then there will be three seeds.The seeds per rack are calculated as follows:seedsPerRack = totalSeedCount / numRacksextraSeeds = totalSeedCount % numRacksFor the example cluster in this post totalSeedCount will be three. Then seedsPersRack will be one, and extraSeeds will be zero.Start Remaining NodesAfter we have a Cassandra node up and running in each rack, the operator proceeds to start the remaining non-seed nodes. I will skip over listing events here because they are the same as the previous ones. At this point the operator iterates over the pods without worrying about the racks. For each pod in which Cassandra is not already running, it will start Cassandra following the same process previously described.Create a PodDisruptionBudgetAfter all Cassandra nodes have been started, Cass Operator creates a PodDisruptionBudget. It generates an event like this:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatedResource 10m6s cass-operator Created PodDisruptionBudget multi-rack-pdbNote: This is one of the dropped events.A PodDisruptionBudget limits the number of pods that can be down from a voluntary disruption. Examples of voluntary disruptions include accidentally deleting a pod or draining a worker node for upgrade or repair.All Cassandra pods in the CassandraDatacenter are managed by the disruption budget. When creating the PodDisruptionBudget, Cass Operator sets the .spec.minAvailable property. This specifies the number of pods that must be available after a pod eviction. Cass Operator sets this to the total number of Cassandra nodes minus one.Create a Cassandra Super UserThe final thing that Cass Operator does is to create a super user in Cassandra:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal CreatedSuperuser 10m6s cass-operator Created superuserEarlier in the provisioning process, Cass Operator creates the super user credentials and stores them in a secret. The secret name can be specified by setting .spec.superuserSecretName.The username is set <.spec.clusterName>-superuser which will be multi-rack-superuser for our example. The password is a random UTF-8 string less than or equal to 55 characters.Note: Cass Operator disables the default super user, cassandra.Each Cassandra pod runs a container named cassandra. We need to talk about sidecars before we talk about the cassandra container. The sidecar pattern is a very well-known and used architectural pattern in Kubernetes. A pod consists of one or more containers. The containers in a pod share the same volume and network interfaces. Examples for sidecars include things like log aggregation, gRPC proxy, and backup / restore to name a few. Cass Operator utilizes the sidecar pattern but in a more unconventional manner.We can take a look at the spec of one of the Cassandra pods to learn more about the cassandra container. Because we are only focused on this one part, most of the output is omitted.$ kubectl -n cass-operator get pod multi-rack-multi-rack-us-east1-b-sts-0 -o yamlapiVersion: v1kind: Pod...spec:... containers: - env: - name: DS_LICENSE value: accept - name: DSE_AUTO_CONF_OFF value: all - name: USE_MGMT_API value: "true" - name: MGMT_API_EXPLICIT_START value: "true" - name: DSE_MGMT_EXPLICIT_START value: "true" image: datastax/cassandra-mgmtapi-3_11_6:v0.1.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/liveness port: 8080 scheme: HTTP initialDelaySeconds: 15 periodSeconds: 15 successThreshold: 1 timeoutSeconds: 1 name: cassandra ports: - containerPort: 9042 name: native protocol: TCP - containerPort: 8609 name: inter-node-msg protocol: TCP - containerPort: 7000 name: intra-node protocol: TCP - containerPort: 7001 name: tls-intra-node protocol: TCP - containerPort: 8080 name: mgmt-api-http protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /api/v0/probes/readiness port: 8080 scheme: HTTP initialDelaySeconds: 20 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /config name: server-config - mountPath: /var/log/cassandra name: server-logs - mountPath: /var/lib/cassandra name: server-data... There are two lines in the output on which we want to focus. The first line is:name: cassandraThis is the name of the container. There are other containers listed in the output, but we are only concerned with the cassandra one.The second line that we are interested in is:image: datastax/cassandra-mgmtapi-3_11_6:v0.1.0The image property specifies the image that the cassandra container is running. This is different from the Cassandra images such as the ones found on Docker Hub. This image is for the Management API Sidecar. There have been lots of discussions on the Cassandra community mailing lists about management sidecars. In fact there is even a Cassandra Enhancement Proposal (CEP) for providing an official, community based sidecar. The Management API Sidecar, or management sidecar for short, was not designed specifically for Kubernetes.The process started in the cassandra container is the management sidecar rather than the CassandraDaemon process. The sidecar is responsible for starting/stopping the node. In addition to providing lifecycle management, the sidecar also provides configuration management, health checks, and per node actions (i.e., nodetool).There is plenty to more to say about the management sidecar, but that is for another post.Hopefully this post gives you a better understanding of Cass Operator and Kubernetes in general. While we covered a lot of ground, there is plenty more to discuss like multi-DC clusters and the management sidecar. If you want to hear more about Cassandra and Kubernetes, Patrick McFadin put together a series of interviews where he talks to early adopters in the field. Check out “Why Tomorrow’s Cassandra Deployments Will Be on Kubernetes” It will be available for streams as a part of the DataStax Accelerate online conference https://dtsx.io/3ex1Eop.

Illustration Image

In this post we are going to take a deep dive look at provisioning a Cassandra cluster using the DataStax Kubernetes operator for Cassandra, Cass Operator. We will set up a multi-rack cluster with each rack in a different availability zone.

For the examples, I will use a nine node, regional cluster in Google Kubernetes Engine (GKE) that is spread across three zones. Here is what my Kubernetes cluster looks like:

$ kubectl get nodes --label-columns failure-domain.beta.kubernetes.io/region,failure-domain.beta.kubernetes.io/zone | awk {'print $1" "$6" "$7'} | column -t
NAME                                     REGION    ZONE
gke-cass-dev-default-pool-3cab2f1f-3swp  us-east1  us-east1-d
gke-cass-dev-default-pool-3cab2f1f-408v  us-east1  us-east1-d
gke-cass-dev-default-pool-3cab2f1f-pv6v  us-east1  us-east1-d
gke-cass-dev-default-pool-63ec3f9d-5781  us-east1  us-east1-b
gke-cass-dev-default-pool-63ec3f9d-blrh  us-east1  us-east1-b
gke-cass-dev-default-pool-63ec3f9d-g4cb  us-east1  us-east1-b
gke-cass-dev-default-pool-b1ee1c3c-5th7  us-east1  us-east1-c
gke-cass-dev-default-pool-b1ee1c3c-ht20  us-east1  us-east1-c
gke-cass-dev-default-pool-b1ee1c3c-xp2v  us-east1  us-east1-c

Without getting into too much detail, I want to quickly cover some fundamental concepts for some of the things we will discuss in this post. Kubernetes is made up of controllers. A controller manages the state of one more Kubernetes resource types. The controller executes an infinite loop continually trying to converge the desired state of resources with their actual state. The controller watches for changes of interest in the Kubernetes cluster, i.e., a resource added, deleted, or updated. When there is a change, a key uniquely identifying the effected resource is added to a work queue. The controller eventually gets the key from the queue and begins whatever work is necessary.

Sometimes a controller has to perform potentially long-running operations like pulling an image from a remote registry. Rather than blocking until the operation completes, the controller usually requeues the key so that it can continue with other work while the operation completes in the background. When there is no more work to do for a resource, i.e. the desired state matches the actual state, the controller removes the key from the work queue.

An operator consists of one or more controllers that manage the state of one or more custom resources. Every controller has a Reconciler object that implements a reconcile loop. The reconcile loop is passed a request, which is the resource key.

A Kubernetes worker node, Kubernetes worker, or worker node is a machine that runs services necessary to run and manage pods. These services include:

  • kubelet
  • kube-proxy
  • container runtime, e.g., Docker

A Cassandra node is the process running in a container.

A Cassandra container is the container, i.e., Docker container, in which the Cassandra node is running.

A Cassandra pod is a Kubernetes pod that includes one more containers. One of those containers is running the Cassandra node.

Apply the cass-operator-manifests.yaml manifests as follows:

$ kubectl create -f https://raw.githubusercontent.com/datastax/cass-operator/b96bfd77775b5ba909bd9172834b4a56ef15c319/docs/user/cass-operator-manifests.yaml
namespace/cass-operator created
serviceaccount/cass-operator created
secret/cass-operator-webhook-config created
customresourcedefinition.apiextensions.k8s.io/cassandradatacenters.cassandra.datastax.com created
clusterrole.rbac.authorization.k8s.io/cass-operator-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator created
role.rbac.authorization.k8s.io/cass-operator created
rolebinding.rbac.authorization.k8s.io/cass-operator created
service/cassandradatacenter-webhook-service created
deployment.apps/cass-operator created
validatingwebhookconfiguration.admissionregistration.k8s.io/cassandradatacenter-webhook-registration created

Note: The operator is deployed in the cass-operator namespace.

Make sure that the operator has deployed successfully. You should see output similar to this:

$ kubectl -n cass-operator get deployments
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
cass-operator   1/1     1            1           2m8s

We need to create a StorageClass that is suitable for Cassandra. Place the following in a file named server-storageclass.yaml:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: server
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  replication-type: none
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

One thing thing to note here is volumeBindingMode: WaitForFirstConsumer. The default value is Immediate and should not be used. It can prevent Cassandra pods from being scheduled on a worker node. If a pod fails to run and its status reports a message like, had volume node affinity conflict, then check the volumeBindingMode of the StorageClass being used. See Topology-Aware Volume Provisioning in Kubernetes for more details.

Create the StorageClass with:

$ kubectl -n cass-operator apply -f server-storageclass.yaml
storageclass.storage.k8s.io/server-storage created

Most Kubernetes resources define spec and status properties. The spec declares the desired state of a resource which includes configuration settings provided by the user, default values expanded by the system, and other properties initialized by other internal components after resource creation. We will talk about the status in a little bit.

The manifest below declares a CassandraDatacenter custom resource. It does not include all possible properties. It includes the minimum necessary to create a multi-zone cluster.

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: multi-rack
spec:
  clusterName: multi-rack
  serverType: cassandra
  serverVersion: 3.11.6
  managementApiAuth:
    insecure: {}
  size: 9
  racks:
  - name: us-east1-b
    zone: us-east1-b
  - name: us-east1-c
    zone: us-east1-c
  - name: us-east1-d
    zone: us-east1-d    
  storageConfig:
    cassandraDataVolumeClaimSpec:
      storageClassName: standard
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi

This spec declares a single Cassandra datacenter. Cass Operator does support multi-DC clusters. It requires creating a separate CassandraDatacenter for each datacenter. Discussion of multi-DC clusters is outside the scope of this post.

The size property specifies the total number of Cassandra nodes in the datacenter.

racks is an array of Rack objects which consist of name and zone properties. The zone should be the name of a zone in GCP (or an AWS zone if the cluster was running in AWS for example). The operator will use this to pin Cassandra pods to Kubernetes workers in the zone. More on this later.

Put the above manifest in a file named multi-rack-cassdc.yaml and then run:

$ kubectl -n cass-operator apply -f multi-rack-cassdc.yaml
cassandradatacenter.cassandra.datastax.com/multi-rack created

This creates a CassandraDatacenter object named multi-rack in the Kubernetes API server. The API server provides a REST API with which clients, like kubectl, interact. The API server maintains state in etcd. Creating a Kubernetes resource ultimately means persisting state in etcd. When the CassandraDatacenter object is persisted, the API server notifies any clients watching for changes, namely Cass Operator. From here the operator takes over. The new object is added to the operator’s internal work queue. The job of the operator is to make sure the desired state, i.e., the spec, matches the actual state of the CassandraDatacenter.

Now that we have created the CassandraDatacenter, it is time to focus our attention on what Cass Operator is doing to build the Cassandra cluster.

We will look at a couple things to monitor the progress of the provisioning or scaling up of the cluster:

  • Changes in the status of the CassandraDatacenter
  • Kubernetes events emitted by the operator

We have already discussed that the spec describes a resource’s desired state. The status on the other hand, describes the object’s current, observed state. Earlier I mentioned that the Kubernetes API server provides a REST API to clients. A Kubernetes object or resource is a REST resource. The status of a Kubernetes resource is typically implemented as a REST subresource that can only be modified by internal, system components. In the case of a CassandraDatacenter, Cass Operator manages the status property.

An event is a Kubernetes resource that is created when objects like pods change state, or when an error occurs. Like other resources, events get stored in the API server. Cass Operator generates a number of events for a CassandraDatacenter.

Understandng both the changes in a CassandraDatacenter’s status and the events emitted by the operator provide valuable insight into what is actually happening during the provisioning process. That understanding also makes it easier to resolve issues when things go wrong. This applies not only to CassandraDatacenter, but also to other Kubernetes resource as well.

We can watch for changes in the status with:

$ kubectl -n cass-operator get -w cassdc multi-rack -o yaml

In the following sections we will discuss each of the status updates that occur while the operator works to create the Cassandra cluster.

Here is what the status looks like initially after creating the CassandraDatacenter:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  nodeStatuses: {}

cassandraOperatorProgress can have one of two values, Ready or Updating. It will change to Ready when the operator has no more work to do for the resource. This simple detail is really important, particularly if you are performing any automation with Cass Operator. For example, I have used Cassandra operators to provision clusters for integration tests. With Cass Operator my test setup code could simply poll cassandraOperatorProgress to know when the cluster is ready.

conditions is an array of DatacenterCondition objects. A lot of Kubernetes resources use conditions in their statuses. Conditions represent the latest observations of an object’s state. They should minimally include type and status fields. The status field can have as its value either True, False, or Unknown. lastTransitionTime is the time the condition transitioned from one status to another. type identifies the condition. CassandraDatacenter currently has the following condition types:

  • Ready
  • Initialized
  • ReplacingNodes
  • ScalingUp
  • Updating
  • Stopped
  • Resuming
  • RollingRestart

Implementing, understanding, and using conditions are often points of confusion. It is intuitive to think of and model a resource’s state as a state machine. Reminding yourself conditions are observations and not a state machine will go a long way in avoiding some of that confusion. It is worth noting there has been a lot of debate in the Kubernetes community about whether conditions should be removed. Some of the latest discussions in this ticket indicate that they are will remain.

lastRollingRestart is only updated when a rolling restart is explicitly requested. As we will see its value will remain unchanged, and therefore we will be ignoring it for this post.

nodeStatuses is a map that provides some details for each node. We will see it get updated as nodes are deployed.

Cassandra Node Starting

With the next update we see that a lastServerNodeStarted property has been added to the status:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:41:24Z"
  nodeStatuses: {}

lastServerNodeStarted gets updated when a Cassandra node is starting up. The operator also adds the label cassandra.datastax.com/node-state: Starting to the Cassandra pod. The astute reader may have noted that I said the lastServerNodeStarted is updated when Cassandra node is starting up rather than when the pod is starting up. For Cass Operator, there is a important distinction between the Cassandra node and the Cassandra container. The Cassandra Container section at the end of the post goes over this in some detail.

Cassandra Node Started

In the next update lastServerNodeStarted is modified and another entry is added to nodeStatuses:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:41:50Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5

The entry is keyed by the pod name, multi-rack-multi-rack-us-east1-b-sts-2. The value consists of two fields - the node’s host id and its IP address.

When Cass Operator determines the Cassandra node is up running, it updates the node-state label to cassandra.datastax.com/node-state: Started. After the label update, the operator uses a label selector query to see which pods have been started and are running. When the operator finds another node running, its host ID and IP address will be added to nodeStatuses.

Remaining Nodes Started

In this section we follow the progression of the rest of the Cassandra cluster being started. lastServerNodeStarted is changed with each of these status updates in addition to nodeStatuses being updated.

multi-rack-multi-rack-us-east1-c-sts-0 is started:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:42:49Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3

Next, multi-rack-multi-rack-us-east1-d-sts-0 is started:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:43:53Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4

Next, multi-rack-multi-rack-us-east1-c-sts-2 is started:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:44:54Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4

Next, multi-rack-multi-rack-us-east1-d-sts-0 is started:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:45:50Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3

With five out of the nine nodes started, now is a good time point out a couple things. First, we see one node at a time is added to nodeStatuses. Based on this it stands to reason Cass Operator is starting nodes serially. That is precisely what is happening.

Secondly, there is roughly a minute difference between the values of lastServerNodeStarted in each status update. It is taking about a minute or so to start each node, which means it should take somewhere between nine and ten minutes for the cluster to be ready. These times will almost certainly vary depending on a number of factors like the type of disks used, the machine type, etc. It is helpful though, particularly for larger cluster sizes, to be able to gauge how long it will take to get the entire cluster up and running.

Next, multi-rack-multi-rack-us-east1-d-sts-2 is started:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:46:51Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3
    multi-rack-multi-rack-us-east1-d-sts-2:
      hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49
      nodeIP: 10.32.5.4

Next, multi-rack-multi-rack-us-east1-b-sts- is started:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:48:00Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-0:
      hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a
      nodeIP: 10.32.1.4
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3
    multi-rack-multi-rack-us-east1-d-sts-2:
      hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49
      nodeIP: 10.32.5.4

Next, multi-rack-multi-rack-us-east1-c-sts-1 is started:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:48:57Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-0:
      hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a
      nodeIP: 10.32.1.4
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-1:
      hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31
      nodeIP: 10.32.7.6
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3
    multi-rack-multi-rack-us-east1-d-sts-2:
      hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49
      nodeIP: 10.32.5.4

Finally, multi-rack-multi-rack-us-east1-b-sts-1 is started:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:48:57Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-0:
      hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a
      nodeIP: 10.32.1.4
    multi-rack-multi-rack-us-east1-b-sts-1:
      hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef
      nodeIP: 10.32.2.6
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-1:
      hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31
      nodeIP: 10.32.7.6
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3
    multi-rack-multi-rack-us-east1-d-sts-2:
      hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49
      nodeIP: 10.32.5.4

Although all nine nodes are now started, the operator still has more work to do. This is evident based on the ScalingUp condition still being True and cassandraOperatorProgress still having a value of Updating.

Cassandra Super User Created

With the next update the superUserUpserted property is added to the status:

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:40:51Z"
    status: "True"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:48:57Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-0:
      hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a
      nodeIP: 10.32.1.4
    multi-rack-multi-rack-us-east1-b-sts-1:
      hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef
      nodeIP: 10.32.2.6
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-1:
      hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31
      nodeIP: 10.32.7.6
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3
    multi-rack-multi-rack-us-east1-d-sts-2:
      hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49
      nodeIP: 10.32.5.4
  superUserUpserted: "2020-05-06T16:49:55Z"

superUserUpserted is the timestamp at which the operator creates a super user in Cassandra. We will explore this in a little more detail when we go through the events.

ScalingUp Transition

In this update the ScalingUp condition transitions to False. This condition changes only after all nodes have been started and after the super user has been created.

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:49:55Z"
    status: "False"
    type: ScalingUp
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:48:57Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-0:
      hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a
      nodeIP: 10.32.1.4
    multi-rack-multi-rack-us-east1-b-sts-1:
      hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef
      nodeIP: 10.32.2.6
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-1:
      hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31
      nodeIP: 10.32.7.6
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3
    multi-rack-multi-rack-us-east1-d-sts-2:
      hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49
      nodeIP: 10.32.5.4
  superUserUpserted: "2020-05-06T16:49:55Z"

Add Initialized and Ready Conditions

Next, the operator adds the Initialized and Ready conditions to the status. Initialized means the CassandraDatacenter was successfully created. The transition for this condition should only happen once. Ready means the cluster can start serving client requests. The Ready condition will remain True during a rolling restart for example but will transition to False when all nodes are stopped. See The Cassandra Container section at the end of the post for more details on starting and stopping Cassandra nodes.

status:
  cassandraOperatorProgress: Updating
  conditions:
  - lastTransitionTime: "2020-05-06T16:49:55Z"
    status: "False"
    type: ScalingUp
  - lastTransitionTime: "2020-05-06T16:49:55Z"
    status: "True"
    type: Initialized
  - lastTransitionTime: "2020-05-06T16:49:55Z"
    status: "True"
    type: Ready
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:48:57Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-0:
      hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a
      nodeIP: 10.32.1.4
    multi-rack-multi-rack-us-east1-b-sts-1:
      hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef
      nodeIP: 10.32.2.6
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-1:
      hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31
      nodeIP: 10.32.7.6
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3
    multi-rack-multi-rack-us-east1-d-sts-2:
      hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49
      nodeIP: 10.32.5.4
  superUserUpserted: "2020-05-06T16:49:55Z"

In the last update, the value of cassandraOperatorProgress is changed to Ready:

status:
  cassandraOperatorProgress: Ready
  conditions:
  - lastTransitionTime: "2020-05-06T16:49:55Z"
    status: "False"
    type: ScalingUp
  - lastTransitionTime: "2020-05-06T16:49:55Z"
    status: "True"
    type: Initialized
  - lastTransitionTime: "2020-05-06T16:49:55Z"
    status: "True"
    type: Ready
  lastRollingRestart: "2020-05-06T16:40:51Z"
  lastServerNodeStarted: "2020-05-06T16:48:57Z"
  nodeStatuses:
    multi-rack-multi-rack-us-east1-b-sts-0:
      hostID: 3b1b60e0-62c6-47fb-93ff-3d164825035a
      nodeIP: 10.32.1.4
    multi-rack-multi-rack-us-east1-b-sts-1:
      hostID: d7246bca-ae64-45ec-8533-7c3a2540b5ef
      nodeIP: 10.32.2.6
    multi-rack-multi-rack-us-east1-b-sts-2:
      hostID: 62399b3b-80f0-42f2-9930-6c4f2477c9bd
      nodeIP: 10.32.0.5
    multi-rack-multi-rack-us-east1-c-sts-0:
      hostID: dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76
      nodeIP: 10.32.6.3
    multi-rack-multi-rack-us-east1-c-sts-1:
      hostID: a55082ba-0692-4ee9-97a2-a1bb16383d31
      nodeIP: 10.32.7.6
    multi-rack-multi-rack-us-east1-c-sts-2:
      hostID: facbbaa0-ffa7-403c-b323-e83e4cab8756
      nodeIP: 10.32.8.5
    multi-rack-multi-rack-us-east1-d-sts-0:
      hostID: c7e43757-92ee-4ca3-adaa-46a128045d4d
      nodeIP: 10.32.4.4
    multi-rack-multi-rack-us-east1-d-sts-1:
      hostID: 785e30ca-5772-4a57-b4bc-4bd7b3b24ebf
      nodeIP: 10.32.3.3
    multi-rack-multi-rack-us-east1-d-sts-2:
      hostID: 8e8733ab-6f7b-4102-946d-c855adaabe49
      nodeIP: 10.32.5.4
  superUserUpserted: "2020-05-06T16:49:55Z"

We now know the operator has completed its work to scale up the cluster. We also know the cluster is initialized and ready for use. Let’s verify the desired state of the CassandraDatacenter matches actual state. We can do this with nodetool status and kubectl get nodes.

$ kubectl -n cass-operator exec -it multi-rack-multi-rack-us-east1-b-sts-0 -c cassandra -- nodetool status
Datacenter: multi-rack
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.32.4.4  84.43 KiB  1            4.8%              c7e43757-92ee-4ca3-adaa-46a128045d4d  us-east1-d
UN  10.32.1.4  70.2 KiB   1            7.4%              3b1b60e0-62c6-47fb-93ff-3d164825035a  us-east1-b
UN  10.32.6.3  65.36 KiB  1            32.5%             dfd6ebfb-2e2c-4f7a-92f8-9fe60fb24e76  us-east1-c
UN  10.32.3.3  103.54 KiB  1            34.0%             785e30ca-5772-4a57-b4bc-4bd7b3b24ebf  us-east1-d
UN  10.32.7.6  70.34 KiB  1            18.1%             a55082ba-0692-4ee9-97a2-a1bb16383d31  us-east1-c
UN  10.32.8.5  65.36 KiB  1            19.8%             facbbaa0-ffa7-403c-b323-e83e4cab8756  us-east1-c
UN  10.32.2.6  65.36 KiB  1            36.5%             d7246bca-ae64-45ec-8533-7c3a2540b5ef  us-east1-b
UN  10.32.0.5  65.36 KiB  1            39.9%             62399b3b-80f0-42f2-9930-6c4f2477c9bd  us-east1-b
UN  10.32.5.4  65.36 KiB  1            7.0%              8e8733ab-6f7b-4102-946d-c855adaabe49  us-east1-d

nodetool status reports nine nodes up across three racks. That looks good. Now let’s verify the pods are running where we expect them to be.

$ kubectl -n cass-operator get pods -l "cassandra.datastax.com/cluster=multi-rack" -o wide | awk {'print $1" "$7'} | column -t
NAME                                    NODE
multi-rack-multi-rack-us-east1-b-sts-0  gke-cass-dev-default-pool-63ec3f9d-5781
multi-rack-multi-rack-us-east1-b-sts-1  gke-cass-dev-default-pool-63ec3f9d-blrh
multi-rack-multi-rack-us-east1-b-sts-2  gke-cass-dev-default-pool-63ec3f9d-g4cb
multi-rack-multi-rack-us-east1-c-sts-0  gke-cass-dev-default-pool-b1ee1c3c-5th7
multi-rack-multi-rack-us-east1-c-sts-1  gke-cass-dev-default-pool-b1ee1c3c-ht20
multi-rack-multi-rack-us-east1-c-sts-2  gke-cass-dev-default-pool-b1ee1c3c-xp2v
multi-rack-multi-rack-us-east1-d-sts-0  gke-cass-dev-default-pool-3cab2f1f-3swp
multi-rack-multi-rack-us-east1-d-sts-1  gke-cass-dev-default-pool-3cab2f1f-408v
multi-rack-multi-rack-us-east1-d-sts-2  gke-cass-dev-default-pool-3cab2f1f-pv6v

Look carefully at the output, and you will see each pod is in fact running on a separate worker node. Furthermore, the pods are running on worker nodes in the expected zones.

The operator reports a number of events useful for monitoring and debugging the provisioning process. As we will see, the events provide additional insights absent from the status updates alone.

There are some nuances with events that can make working with them a bit difficult. First, events are persisted with a TTL. They expire after one hour. Secondly, events can be listed out of order. The ordering appears to be done on the client side with a sort on the Age column. We will go through the events in the order in which the operator generates them. Lastly, while working on this post, I discovered that some events can get dropped. I created this ticket to investigate the issue. Kubernetes has in place some throttling mechanisms to prevent the system from getting overwhelmed by too many events. We won’t go through every single event as there are a lot. We will however cover enough, including some that may be dropped, in order to get an overall sense of what is going on.

We can list all of the events for the CassandraDatacenter with the describe command as follows:

$ kubectl -n cass-operator describe cassdc multi-rack
Events:
  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  ScalingUpRack      12m    cass-operator  Scaling up rack us-east1-b
  Normal  CreatedResource    12m    cass-operator  Created service multi-rack-seed-service
  Normal  CreatedResource    12m    cass-operator  Created service multi-rack-multi-rack-all-pods-service
  Normal  CreatedResource    12m    cass-operator  Created statefulset multi-rack-multi-rack-us-east1-b-sts
  Normal  CreatedResource    12m    cass-operator  Created statefulset multi-rack-multi-rack-us-east1-c-sts
  Normal  CreatedResource    12m    cass-operator  Created statefulset multi-rack-multi-rack-us-east1-d-sts
  Normal  CreatedResource    12m    cass-operator  Created service multi-rack-multi-rack-service
  Normal  ScalingUpRack      12m    cass-operator  Scaling up rack us-east1-c
  Normal  ScalingUpRack      12m    cass-operator  Scaling up rack us-east1-d
  Normal  LabeledPodAsSeed   12m    cass-operator  Labeled pod a seed node multi-rack-multi-rack-us-east1-b-sts-2
  Normal  StartingCassandra  12m    cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-2
  Normal  StartedCassandra   11m    cass-operator  Started Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-2
  Normal  StartingCassandra  11m    cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-c-sts-0
  Normal  StartingCassandra  10m    cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-0
  Normal  StartedCassandra   10m    cass-operator  Started Cassandra for pod multi-rack-multi-rack-us-east1-c-sts-0
  Normal  LabeledPodAsSeed   10m    cass-operator  Labeled as seed node pod multi-rack-multi-rack-us-east1-c-sts-0
  Normal  LabeledPodAsSeed   9m44s  cass-operator  Labeled as seed node pod multi-rack-multi-rack-us-east1-d-sts-0
  Normal  StartedCassandra   9m43s  cass-operator  Started Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-0
  Normal  StartingCassandra  9m43s  cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-c-sts-2
  Normal  StartedCassandra   8m43s  cass-operator  Started Cassandra for pod multi-rack-multi-rack-us-east1-c-sts-2
  Normal  StartingCassandra  8m43s  cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-1
  Normal  StartedCassandra   7m47s  cass-operator  Started Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-1
  Normal  StartingCassandra  7m46s  cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-2
  Normal  StartedCassandra   6m45s  cass-operator  Started Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-2
  Normal  StartingCassandra  6m45s  cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-0
  Normal  LabeledPodAsSeed   5m36s  cass-operator  Labeled as seed node pod multi-rack-multi-rack-us-east1-b-sts-0

In the following sections we will go through several of these events as well as some that are missing.

The first thing that Cass Operator does during the initial reconciliation loop is create a few headless services:

  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  CreatedResource    10m    cass-operator  Created service multi-rack-seed-service
  Normal  CreatedResource    10m    cass-operator  Created service multi-rack-multi-rack-all-pods-service
  Normal  CreatedResource    10m    cass-operator  Created service multi-rack-multi-rack-service    

multi-rack-seed-service exposes all pods running seed nodes. This service is used by Cassandra to configure seed nodes.

multi-rack-multi-rack-all-pods-service exposes all pods that are part of the CassandraDatacenter, regardless of whether they are ready. It is used to scrape metrics with Prometheus.

multi-rack-multi-rack-service exposes ready pods. CQL clients should use this service to establish connections to the cluster.

Next the operator creates three StatefulSets, one for each rack:

  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  CreatedResource    12m    cass-operator  Created statefulset multi-rack-multi-rack-us-east1-b-sts
  Normal  CreatedResource    12m    cass-operator  Created statefulset multi-rack-multi-rack-us-east1-c-sts
  Normal  CreatedResource    12m    cass-operator  Created statefulset multi-rack-multi-rack-us-east1-d-sts

I mentioned earlier the operator will use the zone property specified for each rack to pin pods to Kubernetes workers in the respective zones. The operator uses affinity rules to accomplish this.

Let’s take a look at the spec for multi-rack-multi-rack-us-east1-c-sts to see how this is accomplished:

$ kubectl -n cass-operator get sts multi-rack-multi-rack-us-east1-c-sts -o yaml
...
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In
                values:
                - us-east1-c
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: cassandra.datastax.com/cluster
                operator: Exists
              - key: cassandra.datastax.com/datacenter
                operator: Exists
              - key: cassandra.datastax.com/rack
                operator: Exists
            topologyKey: kubernetes.io/hostname
...            

The nodeAffinity property constrains the worker nodes on which pods in the StatefulSet can be scheduled. requiredDuringSchedulingIgnoredDuringExecution is a NodeSelector which basically declares a query based on labels. In this case, if a node has the label failure-domain.beta.kubernetes.io/zone with a value of us-east1-c, then pods can be scheduled on that node.

Note: failure-domain.beta.kubernetes.io/zone is one of a number of well known labels that are used by the Kubernetes runtime.

I added emphasis to can be because of the podAntiAffinity property that is declared. It constrains the worker nodes on which the pods can be scheduled based on the labels of pods currently running on the nodes. The requiredDuringSchedulingIgnoredDuringExecution property is a PodAffinityTerm that defines labels that determine which pods cannot be co-located on a particular host. In short, this prevents pods from being scheduled on any node on which pods from a CassandraDatacenter are already running. In other words, no two Cassandra nodes should be running on the same Kubernetes worker node.

Note: You can run multiple Cassandra pods on a single worker node by setting .spec.allowMultipleNodesPerWorker to true.

Scale up the Racks

The next events involve scaling up the racks:

Events:
  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  ScalingUpRack      12m    cass-operator  Scaling up rack us-east1-b
  Normal  ScalingUpRack      12m    cass-operator  Scaling up rack us-east1-c
  Normal  ScalingUpRack      12m    cass-operator  Scaling up rack us-east1-d

The StatefulSets are initially created with zero replicas. They are subsequently scaled up to the desired replica count, which is three (per StatefulSet) in this case.

Label the First Seed Node Pod

After the StatefulSet controller starts creating pods, Cass Operator applies the following label to a pod to designate it as a Cassandra seed node:

cassandra.datastax.com/seed-node: "true"

At this stage in the provisioning process, no pods have the seed-node label. The following event indicates that the operator designates the pod to be a seed node:

Events:
  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  LabeledPodAsSeed   12m    cass-operator  Labeled pod a seed node multi-rack-multi-rack-us-east1-b-sts-2   

Note: You can use a label selector to query for all seed node pods, e.g., kubectl -n cass-operator get pods -l cassandra.datastax.com/seed-node="true".

Start the First Seed Node

Next the operator starts the first seed node:

Events:
  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  StartingCassandra  12m    cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-2

The operator applies the label cassandra.datastax.com/node-state: Starting to the pod. The operator then requeues the request with a short delay, allowing time for the Cassandra node to start. Requeuing the request ends the current reconciliation.

If you are familiar with Kubernetes, this step of starting the Cassandra node may seem counter-intuitive because pods/containers cannot exist in a stopped state. See The Cassandra Container section at the end of the post for more information.

In a subsequent reconciliation loop the operator finds that multi-rack-multi-rack-us-east1-b-sts-2 has been started and records the following event:

Events:
  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  StartedCassandra   11m    cass-operator  Started Cassandra for pod multi-rack-multi-rack-us-east1-b-sts-2

Then the cassandra.datastax.com/node-state label is updated to a value of Started to indicate the Cassandra node is now running. The event is recorded and the labeled is updated only when the Cassandra container’s readiness probe passes. If the readiness probe fails, the operator will requeue the request, ending the current reconciliation loop.

Start One Node Per Rack

After the first node, multi-rack-multi-rack-us-east1-b-sts-2, is running, the operator makes sure there is a node per rack running. Here is the sequence of events for a given node:

Events:
  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  StartingCassandra  8m43s  cass-operator  Starting Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-1
  
  Normal  StartedCassandra   7m47s  cass-operator  Started Cassandra for pod multi-rack-multi-rack-us-east1-d-sts-1
  
 Normal  LabeledPodAsSeed   9m44s  cass-operator  Labeled as seed node pod multi-rack-multi-rack-us-east1-d-sts-1 

Let’s break down what is happening here.

  • The cassandra.datastax.com/node-state: Starting label is applied to multi-rack-multi-rack-us-east1-d-sts-1
  • Cassandra is started
  • The request is requeued
  • On a subsequent reconciliation loop when Cassandra is running (as determined by the readiness probe), two things happen
    • The cassandra.datastax.com/seed-node="true" label is applied to the pod, making it a seed node
    • The cassandra.datastax.com/node-state label is updated to a value of Started

The operator will repeat this process for another rack which does not yet have a node running.

Now is a good time to discuss how the operator determines how many seeds there should be in total for the datacenter as well as how many seeds there should be per rack.

If the datacenter consists of only one or two nodes, then there will be one or two seeds respectively. If there are more than three racks, then the number of seeds will be set to the number of racks. If neither of those conditions hold, then there will be three seeds.

The seeds per rack are calculated as follows:

seedsPerRack = totalSeedCount / numRacks
extraSeeds = totalSeedCount % numRacks

For the example cluster in this post totalSeedCount will be three. Then seedsPersRack will be one, and extraSeeds will be zero.

Start Remaining Nodes

After we have a Cassandra node up and running in each rack, the operator proceeds to start the remaining non-seed nodes. I will skip over listing events here because they are the same as the previous ones. At this point the operator iterates over the pods without worrying about the racks. For each pod in which Cassandra is not already running, it will start Cassandra following the same process previously described.

Create a PodDisruptionBudget

After all Cassandra nodes have been started, Cass Operator creates a PodDisruptionBudget. It generates an event like this:

Events:
  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  CreatedResource    10m6s  cass-operator  Created PodDisruptionBudget multi-rack-pdb

Note: This is one of the dropped events.

A PodDisruptionBudget limits the number of pods that can be down from a voluntary disruption. Examples of voluntary disruptions include accidentally deleting a pod or draining a worker node for upgrade or repair.

All Cassandra pods in the CassandraDatacenter are managed by the disruption budget. When creating the PodDisruptionBudget, Cass Operator sets the .spec.minAvailable property. This specifies the number of pods that must be available after a pod eviction. Cass Operator sets this to the total number of Cassandra nodes minus one.

Create a Cassandra Super User

The final thing that Cass Operator does is to create a super user in Cassandra:

Events:
  Type    Reason             Age    From           Message
  ----    ------             ----   ----           -------
  Normal  CreatedSuperuser   10m6s  cass-operator  Created superuser

Earlier in the provisioning process, Cass Operator creates the super user credentials and stores them in a secret. The secret name can be specified by setting .spec.superuserSecretName.

The username is set <.spec.clusterName>-superuser which will be multi-rack-superuser for our example. The password is a random UTF-8 string less than or equal to 55 characters.

Note: Cass Operator disables the default super user, cassandra.

Each Cassandra pod runs a container named cassandra. We need to talk about sidecars before we talk about the cassandra container. The sidecar pattern is a very well-known and used architectural pattern in Kubernetes. A pod consists of one or more containers. The containers in a pod share the same volume and network interfaces. Examples for sidecars include things like log aggregation, gRPC proxy, and backup / restore to name a few. Cass Operator utilizes the sidecar pattern but in a more unconventional manner.

We can take a look at the spec of one of the Cassandra pods to learn more about the cassandra container. Because we are only focused on this one part, most of the output is omitted.

$ kubectl -n cass-operator get pod multi-rack-multi-rack-us-east1-b-sts-0 -o yaml
apiVersion: v1
kind: Pod
...
spec:
...
      containers:
      - env:
        - name: DS_LICENSE
          value: accept
        - name: DSE_AUTO_CONF_OFF
          value: all
        - name: USE_MGMT_API
          value: "true"
        - name: MGMT_API_EXPLICIT_START
          value: "true"
        - name: DSE_MGMT_EXPLICIT_START
          value: "true"
        image: datastax/cassandra-mgmtapi-3_11_6:v0.1.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v0/probes/liveness
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 1
        name: cassandra
        ports:
        - containerPort: 9042
          name: native
          protocol: TCP
        - containerPort: 8609
          name: inter-node-msg
          protocol: TCP
        - containerPort: 7000
          name: intra-node
          protocol: TCP
        - containerPort: 7001
          name: tls-intra-node
          protocol: TCP
        - containerPort: 8080
          name: mgmt-api-http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v0/probes/readiness
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 20
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /config
          name: server-config
        - mountPath: /var/log/cassandra
          name: server-logs
        - mountPath: /var/lib/cassandra
          name: server-data
...          

There are two lines in the output on which we want to focus. The first line is:

name: cassandra

This is the name of the container. There are other containers listed in the output, but we are only concerned with the cassandra one.

The second line that we are interested in is:

image: datastax/cassandra-mgmtapi-3_11_6:v0.1.0

The image property specifies the image that the cassandra container is running. This is different from the Cassandra images such as the ones found on Docker Hub. This image is for the Management API Sidecar. There have been lots of discussions on the Cassandra community mailing lists about management sidecars. In fact there is even a Cassandra Enhancement Proposal (CEP) for providing an official, community based sidecar. The Management API Sidecar, or management sidecar for short, was not designed specifically for Kubernetes.

The process started in the cassandra container is the management sidecar rather than the CassandraDaemon process. The sidecar is responsible for starting/stopping the node. In addition to providing lifecycle management, the sidecar also provides configuration management, health checks, and per node actions (i.e., nodetool).

There is plenty to more to say about the management sidecar, but that is for another post.

Hopefully this post gives you a better understanding of Cass Operator and Kubernetes in general. While we covered a lot of ground, there is plenty more to discuss like multi-DC clusters and the management sidecar. If you want to hear more about Cassandra and Kubernetes, Patrick McFadin put together a series of interviews where he talks to early adopters in the field. Check out “Why Tomorrow’s Cassandra Deployments Will Be on Kubernetes” It will be available for streams as a part of the DataStax Accelerate online conference https://dtsx.io/3ex1Eop.

Related Articles

migration
cassandra
kubernetes

How to Migrate Your Cassandra Database to Kubernetes with Zero Downtime

DataStax

11/15/2023

cassandra
kubernetes

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra