Kubernetes for Absolute Beginners: A Gentle Introduction (with ClickHouse® in Mind)

This is the first article in our series on running the ClickHouse® database on Kubernetes with the Altinity® Kubernetes Operator. Before we touch ClickHouse or the operator at all, we need to be comfortable with Kubernetes itself. This article assumes you have never used Kubernetes and explains every concept from the ground up, with the things a database like ClickHouse needs kept front of mind.

By the end you will understand what Kubernetes is, the handful of building blocks you will meet again and again, and why a database needs some of them more than a typical web app does.

What is Kubernetes, and what problem does it solve

Imagine you have an application packaged as a container, and you want to run it reliably. On one laptop that is easy. Now imagine you need ten copies of it across five machines, you want them to restart automatically when they crash, you want new versions rolled out without downtime, and you want storage that survives a machine failure. Doing all of that by hand is exhausting and error prone.

Kubernetes is an open-source system that does this orchestration for you. You describe the desired state of your application in a text file, and Kubernetes works continuously to make reality match that description. If a container dies, Kubernetes starts a new one. If a machine disappears, Kubernetes reschedules its work elsewhere. This idea, where you declare what you want and the system reconciles toward it, is the single most important thing to understand about Kubernetes. Everything else builds on it.

A quick word on containers

Kubernetes runs containers, so a one-paragraph recap helps. A container packages an application together with everything it needs to run, its libraries and dependencies, into a single isolated unit. The most common tool for building and running containers is Docker. A container image is the blueprint, and a running container is a live instance of that image. ClickHouse publishes official container images, so running it in a container is straightforward. Kubernetes takes those containers and runs them across many machines for you.

The cluster: control plane and nodes

A Kubernetes cluster is a set of machines working together. These machines come in two roles.

The control plane is the brain. It holds the desired state, makes scheduling decisions, and runs the reconciliation loops that keep everything on track. When you submit a request, you are talking to the control plane's API server.

The worker nodes are the muscle. A node is a machine, virtual or physical, that actually runs your containers. A cluster usually has several nodes, and Kubernetes decides which node each piece of your workload runs on. As a beginner you will not manage the control plane directly; tools like minikube, which we cover in the next article, set it up for you.

Pods: the smallest unit you deploy

You might expect to deploy a container directly, but Kubernetes wraps one or more containers in a slightly larger unit called a pod. A pod is the smallest thing Kubernetes schedules. Most of the time a pod holds exactly one main container, for example one ClickHouse server. Containers in the same pod share a network address and can share storage, which is useful for helper containers that sit alongside the main one.

The important beginner insight is that pods are disposable. Kubernetes can delete a pod and create a replacement at any time, on any node. The replacement is a fresh pod with a new network address. This is fine for stateless applications, but it raises two obvious questions for a database: how does anything find the pod if its address keeps changing, and what happens to the data on disk when a pod is replaced? Kubernetes has answers for both, and they are Services and persistent storage. We will get to them.

Controllers: keeping the right number of pods running

You rarely create pods by hand. Instead you tell a higher-level object how many pods you want and what they should look like, and that object, called a controller, keeps that many healthy pods running. Two controllers matter for us.

A Deployment manages a set of identical, interchangeable pods. It is perfect for stateless applications such as a web front end, where any pod is as good as any other and they have no individual identity. Deployments make rolling updates and scaling trivial.

A StatefulSet is the one that matters for databases. Unlike a Deployment, a StatefulSet gives each pod a stable, predictable identity that persists across restarts. The first pod is always named with a 0 suffix, the second with a 1, and so on, and each pod keeps its own dedicated storage. When a StatefulSet pod is replaced, the new pod takes the same identity and reattaches the same storage. This is exactly what a database cluster needs, because in ClickHouse a particular replica must keep being the same replica, with the same data, even after a restart. This is why the Altinity Kubernetes Operator builds ClickHouse clusters on StatefulSets rather than Deployments.

Services: a stable address for moving pods

Since pods come and go with changing addresses, Kubernetes gives you a Service, which is a stable name and address that always points at the right set of pods, no matter how often they are replaced. Other applications talk to the Service, and Kubernetes routes the traffic to a healthy pod behind it.

There are a few kinds of Service you will hear about. A ClusterIP Service is reachable only from inside the cluster, which is the default. A NodePort Service opens a port on every node so you can reach it from outside. A LoadBalancer Service asks your cloud provider for an external load balancer with a public address. There is also a special headless Service, which gives each pod its own stable DNS name rather than load-balancing across them; StatefulSets use a headless Service so that you can address an individual database replica directly. You do not need to memorize these now. Just remember that a Service is the stable front door to pods that are otherwise constantly changing.

Storage: keeping data when pods disappear

By default, anything a container writes to its own filesystem vanishes when the pod is replaced. For a database that would be a disaster, so Kubernetes separates storage from pods using three connected ideas.

A PersistentVolume, or PV, is an actual piece of storage in the cluster, for example a cloud disk. A PersistentVolumeClaim, or PVC, is a request for storage of a certain size and type, made by a workload. Kubernetes matches the claim to a volume and attaches it to the pod. Because the PVC is tied to the StatefulSet pod's identity and not to the pod itself, the data survives when the pod is replaced; the new pod reattaches the same claim and the same data.

Tying these together is the StorageClass, which describes a kind of storage your cluster can create on demand. With a StorageClass in place, you do not have to pre-create disks by hand. When a claim asks for storage, the cluster provisions a matching volume automatically, a process called dynamic provisioning. When you later deploy ClickHouse, you will hand it a StorageClass and a size, and the storage will appear. We devote a whole later article in this series to ClickHouse storage, because getting it right is what separates a toy from a real deployment.

Configuration: ConfigMaps and Secrets

Applications need configuration, and some of that configuration is sensitive, such as passwords. Kubernetes offers two objects for this. A ConfigMap holds plain configuration data, for example settings files. A Secret holds sensitive data such as passwords and certificates, kept separate so it can be handled more carefully. When you configure ClickHouse users later, you will store their passwords in Secrets rather than writing them in plain text, and the operator will wire them in for you.

Namespaces: keeping things tidy

A namespace is a way to partition a cluster into separate areas, a bit like folders. You might keep your database in one namespace and your monitoring tools in another. It keeps names from colliding and makes access control easier. We will install the operator and ClickHouse into their own namespaces so everything stays organized.

kubectl: how you talk to the cluster

You interact with a Kubernetes cluster using a command-line tool called kubectl. You apply a YAML file that describes what you want with kubectl apply, you list things with kubectl get, and you inspect a specific object in detail with kubectl describe. Almost every step in this series runs through kubectl, and you will be fluent in the handful of commands that matter very quickly.

Here is the shape of a tiny Kubernetes manifest so the YAML feels less foreign. Every object has an apiVersion, a kind, some metadata such as a name, and a spec describing what you want:

apiVersion: v1
kind: Pod
metadata:
  name: hello
spec:
  containers:
    - name: hello
      image: busybox
      command: ["sh", "-c", "echo Hello from Kubernetes && sleep 3600"]

You will almost never write a bare pod like this in practice, but it shows the four-part structure that every Kubernetes object shares.

The operator pattern: the reason this series exists

Everything so far is generic Kubernetes. Running a single container is easy with these pieces, but running a correct, replicated ClickHouse cluster means coordinating StatefulSets, headless Services, persistent volumes, configuration files, users, and a coordination service called ClickHouse Keeper, all kept consistent as you scale and upgrade. Doing that by hand is a lot of fiddly, error-prone YAML.

This is what the operator pattern solves. An operator is a program that runs inside your cluster and extends Kubernetes with knowledge of a specific application. You give it a single, high-level description of the cluster you want, and it creates and manages all the underlying objects for you, the way an expert operator would. The Altinity Kubernetes Operator is exactly this for ClickHouse. Instead of writing dozens of manifests, you describe a ClickHouseInstallation, and the operator does the rest. We introduce it properly a few articles from now, after you have run ClickHouse the manual way and felt the pain it removes.

Why run ClickHouse on Kubernetes at all

It is worth stating the payoff. Running ClickHouse on Kubernetes gives you self-healing pods, declarative configuration you can keep in version control, easy scaling of shards and replicas, rolling upgrades without downtime, and portability across clouds and on-premises. For an analytical database that often grows and changes, that operational leverage is significant. The cost is the learning curve you are climbing right now, and this series exists to make that climb gentle.

What is next

You now know the vocabulary: cluster, node, pod, Deployment, StatefulSet, Service, PersistentVolume, PersistentVolumeClaim, StorageClass, ConfigMap, Secret, namespace, and the operator pattern. In the next article we set up a real Kubernetes cluster on your own machine with minikube and k3s, so you have a place to practice everything that follows.

Kubernetes for Absolute Beginners: A Gentle Introduction (with ClickHouse® in Mind)

What is Kubernetes, and what problem does it solve

A quick word on containers

The cluster: control plane and nodes

Pods: the smallest unit you deploy

Controllers: keeping the right number of pods running

Services: a stable address for moving pods

Storage: keeping data when pods disappear

Configuration: ConfigMaps and Secrets

Namespaces: keeping things tidy

kubectl: how you talk to the cluster

The operator pattern: the reason this series exists

Why run ClickHouse on Kubernetes at all

What is next

References

Expert ClickHouse services

Manage ClickHouse with CHOps

Related articles

Introduction to the Altinity® Kubernetes Operator for ClickHouse®

Run a Single-Node ClickHouse® on Kubernetes (the Manual Way)

Set Up a Local Kubernetes Cluster: minikube and k3s for Beginners