All posts
MariaDB to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

MariaDB to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

June 20, 202613 min readKanishga Subramani
Share:

This guide shows you how to stream every insert, update, and delete from MariaDB into the ClickHouse® database in near real time, using Debezium, Apache Kafka, and Docker. By the end you will have a working Change Data Capture (CDC) pipeline and a real open dataset flowing through it.

This article is part of our CDC series. It is self-contained, so you do not need the others, but if the words Debezium and ClickHouse are new to you, the gentle overview in What is Debezium and how to offload analytics to ClickHouse is a good five-minute primer first. If you have already read our MySQL guide, much of this will look familiar, because MariaDB and MySQL share a common heritage.

No prior experience with Debezium, Kafka, or ClickHouse is assumed.

What you will build

We are building a pipeline where MariaDB records every change in its binary log, Debezium reads that log and turns each change into an event, Apache Kafka stores the events durably, and the ClickHouse Kafka Connect Sink writes them into a ClickHouse table. Change a row in MariaDB, and the same change appears in ClickHouse a moment later. Everything runs locally in Docker.

What is Change Data Capture, in plain English

Imagine your application stores data in MariaDB, and your analytics team wants to run heavy reports. Running those reports directly on the production database would slow it down for real users. Copying the whole database every night works but is always stale and wasteful.

Change Data Capture is the better way. Instead of copying everything, it watches for changes and copies only what actually changed, as it happens. It does this cheaply because every database already keeps a log of every change for crash recovery. In MariaDB this log is called the binary log, or binlog. Debezium reads the binlog directly, so it adds almost no load to MariaDB.

Why MariaDB and ClickHouse are a great pair

MariaDB is a transactional database, excellent at many small reads and writes with strong correctness. ClickHouse is a columnar database built for analytics, able to scan and aggregate billions of rows in milliseconds. Keep MariaDB as your source of truth, stream its changes into ClickHouse, and you get a reliable transactional database for the app plus a very fast analytical database for reports, always in sync.

The tools and the exact versions

Pinning specific, compatible versions is what makes a tutorial like this run. Do not assume a different tag behaves the same way.

ComponentRoleImage and version
MariaDBSource databasemariadb:11.4 (LTS)
Apache KafkaEvent log / transportapache/kafka:4.1.0 (KRaft mode, no ZooKeeper)
DebeziumMariaDB source connectorquay.io/debezium/connect:3.5
ClickHouse Kafka Connect SinkLoads events into ClickHousev1.3.7
ClickHouseAnalytics databaseclickhouse/clickhouse-server:26.3 (LTS)

Debezium 3.5 (specifically 3.5.2.Final, released 2026-06-02) is built and tested against Kafka 4.1. Debezium 3.5 ships a dedicated MariaDB connector and supports MariaDB 11.4.x and 11.7.x, so mariadb:11.4, the current Long Term Support release, is a good choice. ClickHouse 26.3 is the current Long Term Support release. Kafka 4 uses a built-in coordination system called KRaft and has no ZooKeeper; if a guide tells you to start ZooKeeper, it is out of date.

A note for readers who know MySQL: older setups used the MySQL connector against MariaDB. Debezium now has a proper MariaDB connector that uses the MariaDB driver, so we use that.

The dataset

We will use the OpenFlights airlines dataset, which lists real airlines from around the world and is published under the Open Database License, so it is free to use. We load a small slice of it into MariaDB and then watch our edits flow to ClickHouse.

Prerequisites

You need Docker and Docker Compose, plus roughly 4 GB of free memory.

Step 1: Prepare a project folder

mkdir mariadb-to-clickhouse-cdc
cd mariadb-to-clickhouse-cdc

Step 2: Download the ClickHouse Kafka Connect Sink

The Debezium connect image includes the MariaDB source connector but not the connector that writes to ClickHouse. We add that ourselves.

mkdir -p connect-plugins
cd connect-plugins
curl -L -o clickhouse-kafka-connect.zip \
  https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.3.7/clickhouse-kafka-connect-v1.3.7.zip
unzip clickhouse-kafka-connect.zip
rm clickhouse-kafka-connect.zip
cd ..

Step 3: Create a database user for Debezium

Debezium needs a MariaDB user allowed to read the binary log. Create a file named init-debezium-user.sql:

CREATE USER 'debezium'@'%' IDENTIFIED BY 'dbz';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT
  ON *.* TO 'debezium'@'%';
FLUSH PRIVILEGES;

The key privileges are REPLICATION SLAVE, which lets the user read the binlog, and REPLICATION CLIENT, which lets it query binlog status. In recent MariaDB versions REPLICATION CLIENT is also known as BINLOG MONITOR; the statement above is accepted either way. SELECT is needed for the initial snapshot. MariaDB runs this file automatically on first startup.

Step 4: The Docker Compose file

Create docker-compose.yml:

services:
  # The source database, configured to write a binary log Debezium can read.
  mariadb:
    image: mariadb:11.4
    command:
      - "--server-id=184054"
      - "--log-bin=mariadb-bin"
      - "--binlog-format=ROW"
      - "--binlog-row-image=FULL"
    environment:
      MARIADB_ROOT_PASSWORD: rootpw
      MARIADB_DATABASE: flights
    ports:
      - "3306:3306"
    volumes:
      - ./init-debezium-user.sql:/docker-entrypoint-initdb.d/init-debezium-user.sql:ro
 
  # A single-node Kafka 4 broker in KRaft mode (no ZooKeeper).
  kafka:
    image: apache/kafka:4.1.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
 
  # Kafka Connect (Debezium image) with the ClickHouse sink mounted in.
  connect:
    image: quay.io/debezium/connect:3.5
    depends_on:
      - kafka
      - mariadb
    ports:
      - "8083:8083"
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: cdc-connect
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses
      KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: "false"
      CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: "false"
    volumes:
      - ./connect-plugins/clickhouse-kafka-connect-v1.3.7:/kafka/connect/clickhouse-kafka-connect
 
  # The analytics database.
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    ports:
      - "8123:8123"
      - "9000:9000"
    environment:
      CLICKHOUSE_USER: default
      CLICKHOUSE_PASSWORD: clickhouse
      CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

The MariaDB command line turns on the binary log (--log-bin), sets a unique server-id that Debezium requires, and records the full content of every changed row (--binlog-format=ROW and --binlog-row-image=FULL), which Debezium needs to build complete events.

Step 5: Start the stack

docker compose up -d
docker compose ps

Confirm Kafka Connect can see both connector plugins:

curl -s http://localhost:8083/connector-plugins | grep -o '"class":"[^"]*"'

You should see io.debezium.connector.mariadb.MariaDbConnector and com.clickhouse.kafka.connect.ClickHouseSinkConnector. If the ClickHouse one is missing, the volume mount path does not match the version you unzipped.

Step 6: Create a table and load the dataset

Open a MariaDB shell:

docker compose exec mariadb mariadb -uroot -prootpw flights

Create an airlines table and seed it with real rows from the OpenFlights dataset:

CREATE TABLE airlines (
  airline_id INT PRIMARY KEY,
  name       VARCHAR(255) NOT NULL,
  iata       VARCHAR(3),
  icao       VARCHAR(4),
  country    VARCHAR(100),
  active     CHAR(1)
);
 
INSERT INTO airlines (airline_id, name, iata, icao, country, active) VALUES
  (10,  '40-Mile Air',       'Q5', 'MLA', 'United States', 'Y'),
  (21,  'Aigle Azur',        'ZI', 'AAF', 'France',        'Y'),
  (22,  'Aloha Airlines',    'AQ', 'AAH', 'United States', 'Y'),
  (24,  'American Airlines', 'AA', 'AAL', 'United States', 'Y'),
  (35,  'Allegiant Air',     'G4', 'AAY', 'United States', 'Y'),
  (112, 'Astraeus',          '5W', 'AEU', 'United Kingdom', 'Y'),
  (120, 'Alliance Airlines', 'QQ', 'UTY', 'Australia',     'Y'),
  (137, 'Air France',        'AF', 'AFR', 'France',        'Y');

Every captured table should have a primary key so Debezium can identify rows for updates and deletes; ours uses airline_id. Type exit to leave the shell.

Step 7: Create the target table in ClickHouse

ClickHouse is append-only at heart. It does not update or delete individual rows the way MariaDB does. To reflect updates and deletes, we use a table engine called ReplacingMergeTree, which keeps multiple versions of a row and returns only the newest version per key when asked. We add a version column and a deleted flag.

docker compose exec clickhouse clickhouse-client --password clickhouse
CREATE DATABASE IF NOT EXISTS flights;
 
CREATE TABLE flights.airlines
(
    airline_id Int32,
    name       String,
    iata       String,
    icao       String,
    country    String,
    active     String,
    _version   UInt64,
    _deleted   UInt8
)
ENGINE = ReplacingMergeTree(_version, _deleted)
ORDER BY airline_id;

ORDER BY airline_id declares the unique key and must match the MariaDB primary key. ReplacingMergeTree(_version, _deleted) keeps the row with the highest _version per airline_id and treats _deleted = 1 rows as removed.

Step 8: Register the Debezium MariaDB source connector

Create a file named mariadb-source.json:

{
  "name": "mariadb-source",
  "config": {
    "connector.class": "io.debezium.connector.mariadb.MariaDbConnector",
    "tasks.max": "1",
    "database.hostname": "mariadb",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "topic.prefix": "flights",
    "database.include.list": "flights",
    "table.include.list": "flights.airlines",
 
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-history.flights",
 
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.delete.tombstone.handling.mode": "rewrite",
    "transforms.unwrap.add.fields": "op,source.ts_ms"
  }
}

Send it to Kafka Connect:

curl -X POST -H "Content-Type: application/json" \
  --data @mariadb-source.json \
  http://localhost:8083/connectors

The settings mirror the MySQL connector, because the two databases work the same way at the binlog level. database.server.id must be unique and distinct from MariaDB's own. The two schema.history.internal.kafka lines name the schema history topic, which Debezium manages for you; it lets the connector interpret older binlog entries even after a table changes. topic.prefix of flights means events land in a Kafka topic named flights.flights.airlines (prefix, then database, then table).

The transforms block is the part that makes ClickHouse happy. ExtractNewRecordState flattens Debezium's nested event into a plain row. delete.tombstone.handling.mode set to rewrite turns a delete into a normal-looking row with an added __deleted field set to true. add.fields of op,source.ts_ms attaches the operation type and the event timestamp, which we use as the version column. Like MySQL, MariaDB orders changes by binlog file and position rather than a single number, so we use the event timestamp as a simple monotonic version. Be aware that two changes to the same row within the same millisecond cannot be told apart by the timestamp; for very hot tables you would derive the version from the binlog coordinates instead.

These option names are correct for Debezium 3.5. Check the documentation if you use a different version.

Confirm it is running:

curl -s http://localhost:8083/connectors/mariadb-source/status

The state should read RUNNING. Debezium snapshots the existing rows, then streams new changes.

Step 9: Register the ClickHouse sink connector

Create a file named clickhouse-sink.json:

{
  "name": "clickhouse-sink",
  "config": {
    "connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
    "tasks.max": "1",
    "topics": "flights.flights.airlines",
    "hostname": "clickhouse",
    "port": "8123",
    "database": "flights",
    "username": "default",
    "password": "clickhouse",
    "ssl": "false",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
 
    "transforms": "renameFields",
    "transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.renameFields.renames": "__source_ts_ms:_version,__deleted:_deleted"
  }
}

Send it:

curl -X POST -H "Content-Type: application/json" \
  --data @clickhouse-sink.json \
  http://localhost:8083/connectors

The sink reads from the same topic Debezium writes to and inserts into flights.airlines in ClickHouse. The ClickHouse sink matches incoming fields to columns by name, so the names must line up. The ReplaceField transformation renames Debezium's metadata fields (__source_ts_ms and __deleted) to our column names (_version and _deleted).

Step 10: See it work

Query the snapshotted rows. The FINAL keyword collapses all versions of each row down to the newest one; always use it with a ReplacingMergeTree:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT airline_id, name, country FROM flights.airlines FINAL ORDER BY airline_id"

Now make some changes in MariaDB:

docker compose exec mariadb mariadb -uroot -prootpw flights
UPDATE airlines SET country = 'USA' WHERE airline_id = 24;
INSERT INTO airlines (airline_id, name, iata, icao, country, active)
  VALUES (139, 'Air Caledonie International', 'SB', 'ACI', 'France', 'Y');
DELETE FROM airlines WHERE airline_id = 112;

Wait a couple of seconds, then query ClickHouse again, hiding deleted rows:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT airline_id, name, country FROM flights.airlines FINAL WHERE _deleted = 0 ORDER BY airline_id"

You should see American Airlines with its updated country, the new Air Caledonie row, and Astraeus gone. Now run a small analytical query, the kind ClickHouse is built for:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT country, count() FROM flights.airlines FINAL WHERE _deleted = 0 GROUP BY country ORDER BY count() DESC"

You have a working real-time CDC pipeline from MariaDB to ClickHouse.

How updates and deletes really work

When you updated American Airlines, MariaDB wrote the change to its binlog, Debezium emitted a flattened row with the new country and a newer timestamp, and the sink inserted it as a new row in ClickHouse. For a moment ClickHouse held two rows with airline_id = 24. Because the table is a ReplacingMergeTree keyed on airline_id with _version as the version, a FINAL query returns only the row with the highest version. The delete works the same way: Debezium emitted a row marked deleted with a newer timestamp, that row wins, and your query filters it out. The mental model is always: never change a row in place, append a newer version, and let FINAL give you the current state.

Production considerations

This tutorial runs a single node of everything. In production you would run one connector task per table, keep all events for a row in the same Kafka partition (use a single partition while learning, or hash on the primary key), replace plaintext with TLS and managed secrets, use the dedicated debezium user rather than root, run at least three Kafka brokers with replication factor three, and enable the ClickHouse sink's exactly-once mode when correctness is critical. For very hot tables, use a binlog-coordinate version rather than the event timestamp.

Troubleshooting

If the ClickHouse connector is missing from the plugin list, the volume mount points at the wrong folder. If the source connector fails immediately, check for a missing or duplicate database.server.id, or a debezium user lacking REPLICATION SLAVE. If rows never reach ClickHouse, confirm the topic has messages with docker compose exec kafka /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic flights.flights.airlines --from-beginning --max-messages 1. If updates appear duplicated, you forgot FINAL, or your ClickHouse ORDER BY key does not match the MariaDB primary key.

Cleaning up

docker compose down -v

References

What is next

You have now built this pipeline for MariaDB, and the binlog approach is nearly identical to MySQL. The Kafka and ClickHouse half is the same for every source. To see how a very different database fits the same architecture, read our MongoDB to ClickHouse guide, where the source stores nested documents instead of rows. You can also revisit the PostgreSQL and Oracle guides.

If you would like help designing a production-grade CDC pipeline into ClickHouse, the engineers at Quantrail Data do exactly this. Reach out through our services page and we will be glad to help.

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: