All posts
MySQL to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

MySQL to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

June 16, 202616 min readMohamed Hussain S
Share:

This is Part 2 of our three-part series on streaming database changes into the ClickHouse® database with Debezium and Kafka. In Part 1 we connected PostgreSQL to ClickHouse. Here we do the same for MySQL. In Part 3 we tackle Oracle.

If you have read Part 1, large sections of this will feel familiar, and that is the point. The Kafka and ClickHouse half of the pipeline is identical. The only real difference is how Debezium reads changes out of MySQL. We will explain that difference carefully and keep this guide self-contained, so you can follow it without having read Part 1.

No prior experience with Debezium, Kafka, or ClickHouse is assumed.

What you will build

We are building a Change Data Capture (CDC) pipeline that copies every insert, update, and delete from MySQL into ClickHouse in near real time.

The flow is: MySQL records every change in its binary log. Debezium reads that binary log and turns each change into an event. Apache Kafka stores those events durably. The ClickHouse Kafka Connect Sink reads them from Kafka and writes them into a ClickHouse table. Change a row in MySQL, and the same change appears in ClickHouse a moment later.

Everything runs locally in Docker, so a laptop is enough.

What is Change Data Capture, in plain English

Suppose your application stores orders in MySQL. Customers place, edit, and cancel orders all day. Your analytics team wants to run heavy reports, but running them on the production MySQL server would slow the application down for real users.

Copying the whole database to an analytics system every night works, but the data is always stale and the full copy is wasteful. Change Data Capture is the smarter approach: instead of copying everything, it watches for changes and copies only what actually changed, as it happens. The analytics copy stays a few seconds behind production instead of a full day.

The reason CDC is so light on the database is that it does not run queries against your tables to find changes. Every database already keeps a log of every change for crash recovery. In MySQL this log is called the binary log, or binlog. Debezium reads the binlog directly, so it adds almost no load to MySQL.

Why MySQL and ClickHouse are a great pair

MySQL is built for transactions: many small reads and writes, one row at a time, with strong correctness. It powers a huge share of the web.

ClickHouse is built for analytics: scanning billions of rows and aggregating them in milliseconds. It is not meant to be your application's primary database, but it is superb for dashboards and reporting.

Keep MySQL as your source of truth, stream its changes into ClickHouse, and you get a reliable transactional database for the app plus a very fast analytical database for reports, always in sync.

The tools and the exact versions

Pinning specific, compatible versions is what makes a tutorial like this actually run. Mismatched versions are the most common reason these pipelines break, so do not assume a different tag behaves the same way.

ComponentRoleImage and version
MySQLSource databasemysql:8.4 (LTS)
Apache KafkaEvent log / transportapache/kafka:4.1.0 (KRaft mode, no ZooKeeper)
DebeziumMySQL source connectorquay.io/debezium/connect:3.5
ClickHouse Kafka Connect SinkLoads events into ClickHousev1.3.7
ClickHouseAnalytics databaseclickhouse/clickhouse-server:26.3 (LTS)

Debezium 3.5 (specifically 3.5.2.Final, released 2026-06-02) is built and tested against Kafka 4.1, which is why we pair them. Debezium 3.5 supports MySQL 8.0.x, 8.4.x, and 9.x, so mysql:8.4, the current Long Term Support release of MySQL, is a solid choice. ClickHouse 26.3 is the current Long Term Support release.

Kafka 4 uses a built-in system called KRaft and has no ZooKeeper. If a tutorial tells you to start a ZooKeeper container, it predates Kafka 4 and is out of date.

Prerequisites

You need Docker and Docker Compose, plus roughly 4 GB of free memory. That is all.

How MySQL CDC differs from PostgreSQL

If you followed Part 1, here is the one concept that changes.

PostgreSQL uses logical replication, and Debezium tracks a single ever-increasing number called the log sequence number. MySQL instead uses the binary log, which is a series of files with positions inside them. There is no single number that orders every change, so we handle versioning a little differently in ClickHouse (we will use the event timestamp, explained later).

MySQL CDC also requires one extra piece: a schema history topic. Because MySQL records data changes in the binlog but describes table structure separately, Debezium keeps its own record of your tables' structure in a dedicated Kafka topic, so it can correctly interpret older binlog entries even after you change a table. You do not have to manage this topic; you just have to name it in the configuration. PostgreSQL did not need this.

Everything else, including the entire ClickHouse side, is the same.

Step 1: Prepare a project folder

mkdir mysql-to-clickhouse-cdc
cd mysql-to-clickhouse-cdc

Step 2: Download the ClickHouse Kafka Connect Sink

The Debezium connect image includes the MySQL source connector but not the connector that writes to ClickHouse. We add that ourselves.

mkdir -p connect-plugins
cd connect-plugins
curl -L -o clickhouse-kafka-connect.zip \
  https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.3.7/clickhouse-kafka-connect-v1.3.7.zip
unzip clickhouse-kafka-connect.zip
rm clickhouse-kafka-connect.zip
cd ..

You now have a connect-plugins/clickhouse-kafka-connect-v1.3.7 folder of JAR files, which we will mount into the Debezium container.

Step 3: Create a database user for Debezium

Debezium needs a MySQL user with permission to read the binary log. Create a file named init-debezium-user.sql:

CREATE USER 'debezium'@'%' IDENTIFIED BY 'dbz';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT
  ON *.* TO 'debezium'@'%';
FLUSH PRIVILEGES;

The two important privileges are REPLICATION SLAVE, which lets the user read the binlog, and REPLICATION CLIENT, which lets it ask MySQL about binlog status. SELECT is needed for the initial snapshot of existing data. We will have MySQL run this file automatically on first startup.

Step 4: The Docker Compose file

Create docker-compose.yml:

services:
  # The source database, configured to write a binary log Debezium can read.
  mysql:
    image: mysql:8.4
    command:
      - "--server-id=184054"
      - "--log-bin=mysql-bin"
      - "--binlog-format=ROW"
      - "--binlog-row-image=FULL"
      - "--gtid-mode=ON"
      - "--enforce-gtid-consistency=ON"
    environment:
      MYSQL_ROOT_PASSWORD: rootpw
      MYSQL_DATABASE: shop
    ports:
      - "3306:3306"
    volumes:
      # This SQL file runs automatically the first time MySQL starts.
      - ./init-debezium-user.sql:/docker-entrypoint-initdb.d/init-debezium-user.sql:ro
 
  # A single-node Kafka 4 broker in KRaft mode (no ZooKeeper).
  kafka:
    image: apache/kafka:4.1.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
 
  # Kafka Connect (Debezium image) with the ClickHouse sink mounted in.
  connect:
    image: quay.io/debezium/connect:3.5
    depends_on:
      - kafka
      - mysql
    ports:
      - "8083:8083"
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: cdc-connect
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses
      KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: "false"
      CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: "false"
    volumes:
      - ./connect-plugins/clickhouse-kafka-connect-v1.3.7:/kafka/connect/clickhouse-kafka-connect
 
  # The analytics database.
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    ports:
      - "8123:8123"
      - "9000:9000"
    environment:
      CLICKHOUSE_USER: default
      CLICKHOUSE_PASSWORD: clickhouse
      CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

A note on the MySQL settings. server-id is a unique number that identifies this server in a replication setup, and Debezium requires it. log-bin turns on the binary log. binlog-format=ROW and binlog-row-image=FULL make MySQL record the full content of every changed row, which Debezium needs to produce complete events. gtid-mode=ON gives MySQL globally unique transaction identifiers, which makes Debezium more robust when it restarts. In MySQL 8.x the binary log is on by default, but setting these explicitly makes the tutorial clear and reproducible.

Step 5: Start the stack

docker compose up -d
docker compose ps

Give the images a minute to download on the first run. Confirm Kafka Connect can see both connector plugins:

curl -s http://localhost:8083/connector-plugins | grep -o '"class":"[^"]*"'

You should see io.debezium.connector.mysql.MySqlConnector and com.clickhouse.kafka.connect.ClickHouseSinkConnector. If the ClickHouse one is missing, the volume mount path does not match the version you unzipped.

Step 6: Create a table in MySQL and add data

Open a MySQL shell:

docker compose exec mysql mysql -uroot -prootpw shop

Create a customers table and insert a few rows:

CREATE TABLE customers (
  id         INT PRIMARY KEY,
  name       VARCHAR(255) NOT NULL,
  email      VARCHAR(255) NOT NULL,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
 
INSERT INTO customers (id, name, email) VALUES
  (1, 'Ada Lovelace', 'ada@example.com'),
  (2, 'Alan Turing',  'alan@example.com'),
  (3, 'Grace Hopper', 'grace@example.com');

Every table you capture should have a primary key, so Debezium can identify rows for updates and deletes. Type exit to leave the MySQL shell.

Step 7: Create the target table in ClickHouse

ClickHouse is append-only at heart. It does not update or delete individual rows the way MySQL does. To reflect updates and deletes, we use a table engine called ReplacingMergeTree, which keeps multiple versions of a row and, when asked, returns only the newest version per key.

We give it two helper columns: a version number so it knows which copy is newest, and a deleted flag so we can represent a deleted row.

Open a ClickHouse client:

docker compose exec clickhouse clickhouse-client --password clickhouse

Create the database and table:

CREATE DATABASE IF NOT EXISTS shop;
 
CREATE TABLE shop.customers
(
    id         Int32,
    name       String,
    email      String,
    updated_at String,
    -- Filled by the transformation we configure in Step 8.
    _version   UInt64,
    _deleted   UInt8
)
ENGINE = ReplacingMergeTree(_version, _deleted)
ORDER BY id;

ORDER BY id declares that id uniquely identifies a customer, and it must match the primary key in MySQL. ReplacingMergeTree(_version, _deleted) tells ClickHouse to keep the row with the highest _version for each id and to treat rows with _deleted = 1 as removed.

We store updated_at as a String here to keep the tutorial simple, because Debezium encodes timestamps in a specific numeric form. In a real deployment you would convert it to a proper DateTime64 using a materialized column or a transformation.

Step 8: Register the Debezium MySQL source connector

Create a file named mysql-source.json:

{
  "name": "mysql-source",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "topic.prefix": "shop",
    "database.include.list": "shop",
    "table.include.list": "shop.customers",
 
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-history.shop",
 
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.delete.tombstone.handling.mode": "rewrite",
    "transforms.unwrap.add.fields": "op,source.ts_ms"
  }
}

Send it to Kafka Connect:

curl -X POST -H "Content-Type: application/json" \
  --data @mysql-source.json \
  http://localhost:8083/connectors

Let us unpack the settings that differ from PostgreSQL. database.server.id must match a unique server id and be distinct from MySQL's own. The two schema.history.internal.kafka lines name the schema history topic we discussed earlier; Debezium manages it for you. topic.prefix of shop means events for our table land in a Kafka topic named shop.shop.customers (prefix, then database, then table).

The transforms block is the same idea as in Part 1. ExtractNewRecordState flattens Debezium's nested event into a plain row. delete.tombstone.handling.mode set to rewrite turns a delete into a normal-looking row with an added __deleted field set to true. add.fields of op,source.ts_ms attaches the operation type and the event timestamp in milliseconds. We will use that timestamp as our version column.

Why the timestamp and not a log number? Unlike PostgreSQL's single log sequence number, MySQL orders changes by binary log file plus position, which is two values rather than one. The event timestamp is a simple single number that increases over time, which is good enough for most cases. Be aware of one caveat: if two updates to the same row happen within the same millisecond, the timestamp cannot tell them apart, and the winner is not deterministic. For high-write tables where this matters, you would derive a more precise version from the binlog coordinates. We keep it simple here.

These transformation option names are correct for Debezium 3.5. If you use a different version, check that version's ExtractNewRecordState documentation rather than assuming.

Confirm the connector is running:

curl -s http://localhost:8083/connectors/mysql-source/status

The state should read RUNNING. Debezium snapshots the three existing rows, then streams new changes.

Step 9: Register the ClickHouse sink connector

Create a file named clickhouse-sink.json:

{
  "name": "clickhouse-sink",
  "config": {
    "connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
    "tasks.max": "1",
    "topics": "shop.shop.customers",
    "hostname": "clickhouse",
    "port": "8123",
    "database": "shop",
    "username": "default",
    "password": "clickhouse",
    "ssl": "false",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
 
    "transforms": "renameFields",
    "transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.renameFields.renames": "__source_ts_ms:_version,__deleted:_deleted"
  }
}

Send it:

curl -X POST -H "Content-Type: application/json" \
  --data @clickhouse-sink.json \
  http://localhost:8083/connectors

The sink reads from the same topic Debezium writes to, shop.shop.customers, and inserts into shop.customers in ClickHouse. The ClickHouse sink matches incoming fields to columns by name, so the field names must line up with your column names. The ReplaceField transformation renames Debezium's metadata fields (__source_ts_ms and __deleted) to the column names we created (_version and _deleted). The true and false values Debezium produces for the deleted flag become 1 and 0 in the UInt8 column.

Step 10: See it work

Query the three snapshotted rows:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT id, name, email FROM shop.customers FINAL ORDER BY id"

The FINAL keyword tells ClickHouse to collapse all versions of each row down to the newest one. Without it you might briefly see more than one version of a row, because ClickHouse merges versions on its own schedule in the background. With a ReplacingMergeTree, always use FINAL (or filter to the latest version yourself) for correct results.

Now make some changes in MySQL:

docker compose exec mysql mysql -uroot -prootpw shop
UPDATE customers SET email = 'ada@newmail.com' WHERE id = 1;
INSERT INTO customers (id, name, email) VALUES (4, 'Edsger Dijkstra', 'edsger@example.com');
DELETE FROM customers WHERE id = 2;

Wait a couple of seconds, then query ClickHouse again:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT id, name, email FROM shop.customers FINAL WHERE _deleted = 0 ORDER BY id"

You should see Ada with her new email, the new row for Edsger, and Alan gone. The WHERE _deleted = 0 hides the deleted row, which ReplacingMergeTree keeps internally (marked as deleted) so it can override the older live version.

You now have a working real-time CDC pipeline from MySQL to ClickHouse.

How updates and deletes really work

This part trips up most beginners, so it is worth slowing down.

When you updated Ada's email, MySQL wrote the change to its binlog, Debezium emitted a flattened row with the new email and a newer timestamp, and the sink inserted it as a new row in ClickHouse. For a moment ClickHouse held two rows with id = 1. Because the table is a ReplacingMergeTree keyed on id with _version as the version, a FINAL query returns only the row with the highest version, which is the new email. Later, ClickHouse merges the parts in the background and discards the old row.

The delete works the same way: Debezium emitted a row for Alan with _deleted set to true and a newer timestamp, that row wins over Alan's older live row, and your query filters out deleted rows, so Alan disappears.

The mental model: you never update or delete in place. You always append a newer version, and the table engine plus FINAL give you the correct current picture.

Production considerations

This tutorial runs a single node of everything, which is great for learning but not for production. Here is what changes.

Run one Debezium connector task per table. The MySQL connector uses a single task, and the documented pattern when you need to scale is one connector instance per table, so you can spread tables across the cluster.

Mind your Kafka partitions and ordering. If a topic has more than one partition, all events for the same MySQL row must land in the same partition, or an update could be processed before the insert it depends on. With a single partition this is automatic; with several you need hash-based routing on the primary key. Start with one partition while you learn.

Use a more precise version for busy tables. As noted, the event timestamp cannot distinguish two changes in the same millisecond. For high-write tables, derive the version from the binlog coordinates so the ordering is exact.

Replace plaintext everything. Use TLS for ClickHouse ("ssl": "true" and port 8443), real secrets management, and the dedicated debezium user we created rather than root.

Use at least three Kafka brokers with replication factor three. Our single broker has no redundancy.

Turn on exactly-once delivery when correctness is critical. The ClickHouse sink supports exactly-once semantics backed by a ClickHouse feature called KeeperMap. It is off by default.

Troubleshooting

If the ClickHouse connector does not appear in the plugin list, the volume mount points at the wrong folder. Confirm the folder name matches the version you unzipped.

If the source connector fails immediately, the most common causes are a missing or wrong database.server.id, or the debezium user lacking REPLICATION SLAVE. Check the connector status endpoint for the exact error.

If rows never arrive in ClickHouse, confirm the Kafka topic has messages: docker compose exec kafka /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic shop.shop.customers --from-beginning --max-messages 1.

If updates appear as duplicates, you forgot FINAL, or your ClickHouse ORDER BY key does not exactly match the MySQL primary key. The order key and the source key must be the same set of columns.

Cleaning up

docker compose down -v

The -v flag removes the data volumes for a clean slate.

References

What is next

You have now built the same pipeline for two databases and seen that only the source half changes. MySQL reads from the binary log and needs a schema history topic, but Kafka and ClickHouse behave identically to the PostgreSQL setup.

In Part 3 we tackle Oracle, which is the most involved of the three. Oracle uses a mechanism called LogMiner, and it requires enabling archive logging and supplemental logging before Debezium can read changes. The ClickHouse side stays the same, which is exactly why we built on a clean, well-understood foundation.

If you would like help designing a production-grade CDC pipeline into ClickHouse, including partitioning strategy, schema evolution, and monitoring, the engineers at Quantrail Data do this work every day. Reach out through our services page and we will be glad to help.

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: