All posts
SQL Server to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

SQL Server to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

June 22, 202613 min readSanjeev Kumar G
Share:

This guide shows you how to stream every insert, update, and delete from Microsoft SQL Server into the ClickHouse® database in near real time, using Debezium, Apache Kafka, and Docker. SQL Server has its own built-in Change Data Capture feature that you must switch on first, and we walk through every step of that.

This article is self-contained. If Debezium and ClickHouse are brand new to you, the short overview in What is Debezium and how to offload analytics to ClickHouse is a good primer first.

No prior experience with Debezium, Kafka, or ClickHouse is assumed.

What you will build

A pipeline where SQL Server records changes in its transaction log and, with CDC enabled, into change tables, Debezium reads those changes and turns each into an event, Apache Kafka stores the events durably, and the ClickHouse Kafka Connect Sink writes them into a ClickHouse table. Change a row in SQL Server, and the same change appears in ClickHouse a moment later. Everything runs locally in Docker.

What is Change Data Capture, in plain English

Your application stores data in SQL Server, and your analytics team wants to run heavy reports. Running them on production would slow it down for real users, and copying the whole database nightly is stale and wasteful. Change Data Capture copies only what changed, as it happens, by reading the database's own change records. It adds almost no load.

How SQL Server CDC is different

The relational databases earlier in this series expose their change log to Debezium more or less directly. SQL Server is different in one important way: it has its own feature literally called Change Data Capture, and Debezium builds on top of it rather than reading the transaction log itself.

This means there is a setup step you cannot skip. You must enable CDC at the database level, and then enable it on each table you want to capture. When you do, SQL Server starts a background job, run by the SQL Server Agent, that reads the transaction log and copies changes into special change tables. Debezium then reads those change tables. Two consequences follow: the SQL Server Agent must be running, and you must explicitly turn CDC on. We handle both below.

Why SQL Server and ClickHouse are a great pair

SQL Server is a capable transactional database that runs many enterprise applications. Large analytical queries on it are expensive, both in performance and often in licensing. Streaming changes into ClickHouse, a columnar database built for analytics, lets you keep SQL Server as the system of record while doing heavy reporting on a fast, cost-effective analytical database.

The tools and the exact versions

Pinning specific, compatible versions is what makes this run. Do not assume a different tag behaves the same way.

ComponentRoleImage and version
SQL ServerSource databasemcr.microsoft.com/mssql/server:2022-latest
Apache KafkaEvent log / transportapache/kafka:4.1.0 (KRaft mode, no ZooKeeper)
DebeziumSQL Server source connectorquay.io/debezium/connect:3.5
ClickHouse Kafka Connect SinkLoads events into ClickHousev1.3.7
ClickHouseAnalytics databaseclickhouse/clickhouse-server:26.3 (LTS)

Debezium 3.5 (specifically 3.5.2.Final, released 2026-06-02) is built and tested against Kafka 4.1 and supports SQL Server 2017, 2019, and 2022. CDC is available in SQL Server Standard, Enterprise, and Developer editions; the container uses the free Developer edition by default. ClickHouse 26.3 is the current Long Term Support release. Kafka 4 uses KRaft and has no ZooKeeper.

The dataset

We will use the OpenFlights airports dataset, published under the Open Database License, so it is free to use. We load a small slice of it into SQL Server with an ingest script and then watch our edits flow to ClickHouse.

Prerequisites

You need Docker and Docker Compose, plus roughly 6 GB of free memory, since SQL Server is memory-hungry.

Step 1: Prepare a project folder

mkdir sqlserver-to-clickhouse-cdc
cd sqlserver-to-clickhouse-cdc
mkdir -p connect-plugins

Step 2: Download the ClickHouse Kafka Connect Sink

cd connect-plugins
curl -L -o clickhouse-kafka-connect.zip \
  https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.3.7/clickhouse-kafka-connect-v1.3.7.zip
unzip clickhouse-kafka-connect.zip
rm clickhouse-kafka-connect.zip
cd ..

Step 3: The ingest script

This is the script that creates our database, enables CDC, creates the airports table, loads real rows from the OpenFlights dataset, and finally enables CDC on the table. Create a file named init.sql:

-- Create the database and enable Change Data Capture on it.
IF DB_ID('store') IS NULL CREATE DATABASE store;
GO
USE store;
GO
EXEC sys.sp_cdc_enable_db;
GO
 
-- Create the airports table.
CREATE TABLE dbo.airports (
    airport_id INT PRIMARY KEY,
    name       NVARCHAR(255) NOT NULL,
    city       NVARCHAR(255),
    country    NVARCHAR(100),
    iata       NVARCHAR(3),
    altitude   INT
);
GO
 
-- Load a slice of the OpenFlights airports dataset.
INSERT INTO dbo.airports (airport_id, name, city, country, iata, altitude) VALUES
  (1, 'Goroka Airport',             'Goroka',       'Papua New Guinea', 'GKA', 5282),
  (2, 'Madang Airport',             'Madang',       'Papua New Guinea', 'MAG', 20),
  (3, 'Mount Hagen Airport',        'Mount Hagen',  'Papua New Guinea', 'HGU', 5388),
  (4, 'Nadzab Airport',             'Nadzab',       'Papua New Guinea', 'LAE', 239),
  (5, 'Port Moresby Jacksons Intl', 'Port Moresby', 'Papua New Guinea', 'POM', 146);
GO
 
-- Enable CDC on the table. The SQL Server Agent must be running for this to work.
EXEC sys.sp_cdc_enable_table
  @source_schema = N'dbo',
  @source_name   = N'airports',
  @role_name     = NULL,
  @supports_net_changes = 0;
GO

The order matters. CDC must be enabled on the database before you can enable it on a table, and the table must exist before you enable CDC on it. The @role_name = NULL means any user with select rights can read the change data, which is fine for a local tutorial.

Step 4: The Docker Compose file

Create docker-compose.yml. Note the MSSQL_AGENT_ENABLED setting, which turns on the SQL Server Agent that CDC depends on:

services:
  # SQL Server with the Agent enabled, which CDC requires.
  sqlserver:
    image: mcr.microsoft.com/mssql/server:2022-latest
    environment:
      ACCEPT_EULA: "Y"
      MSSQL_PID: "Developer"
      MSSQL_SA_PASSWORD: "Str0ng!Passw0rd"
      MSSQL_AGENT_ENABLED: "true"
    ports:
      - "1433:1433"
 
  # A single-node Kafka 4 broker in KRaft mode (no ZooKeeper).
  kafka:
    image: apache/kafka:4.1.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
 
  # Kafka Connect (Debezium image) with the ClickHouse sink mounted in.
  connect:
    image: quay.io/debezium/connect:3.5
    depends_on:
      - kafka
      - sqlserver
    ports:
      - "8083:8083"
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: cdc-connect
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses
      KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: "false"
      CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: "false"
    volumes:
      - ./connect-plugins/clickhouse-kafka-connect-v1.3.7:/kafka/connect/clickhouse-kafka-connect
 
  # The analytics database.
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    ports:
      - "8123:8123"
      - "9000:9000"
    environment:
      CLICKHOUSE_USER: default
      CLICKHOUSE_PASSWORD: clickhouse
      CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

Step 5: Start the stack and run the ingest script

docker compose up -d

SQL Server takes a minute to become ready on first start. Once it is up, run the ingest script. The 2022 image ships the command-line client at /opt/mssql-tools18/bin/sqlcmd, and the -C flag tells it to trust the container's self-signed certificate:

cat init.sql | docker compose exec -T sqlserver \
  /opt/mssql-tools18/bin/sqlcmd -U sa -P 'Str0ng!Passw0rd' -C

If you see an error that the Agent is not running, wait a few more seconds and run it again; the Agent can take a moment to start.

Step 6: Create the target table in ClickHouse

ClickHouse is append-only at heart. To reflect updates and deletes we use a ReplacingMergeTree, which keeps versions of a row and returns the newest per key when asked, plus a version column and a deleted flag.

docker compose exec clickhouse clickhouse-client --password clickhouse
CREATE DATABASE IF NOT EXISTS store;
 
CREATE TABLE store.airports
(
    airport_id Int32,
    name       String,
    city       String,
    country    String,
    iata       String,
    altitude   Int32,
    _version   UInt64,
    _deleted   UInt8
)
ENGINE = ReplacingMergeTree(_version, _deleted)
ORDER BY airport_id;

Step 7: Register the Debezium SQL Server source connector

Create a file named sqlserver-source.json:

{
  "name": "sqlserver-source",
  "config": {
    "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
    "tasks.max": "1",
    "database.hostname": "sqlserver",
    "database.port": "1433",
    "database.user": "sa",
    "database.password": "Str0ng!Passw0rd",
    "database.names": "store",
    "database.encrypt": "false",
    "topic.prefix": "store",
    "table.include.list": "dbo.airports",
 
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-history.store",
 
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.delete.tombstone.handling.mode": "rewrite",
    "transforms.unwrap.add.fields": "op,source.ts_ms"
  }
}

Send it to Kafka Connect:

curl -X POST -H "Content-Type: application/json" \
  --data @sqlserver-source.json \
  http://localhost:8083/connectors

A few SQL Server specifics. database.names lists the databases to capture (Debezium 3.5 uses this rather than the older singular property). database.encrypt is set to false so the connector does not insist on a trusted TLS certificate, which keeps the local setup simple; in production you would use proper certificates instead. table.include.list uses the schema and table name, dbo.airports. Events land in a Kafka topic named store.dbo.airports, which is the topic prefix, the schema, and the table.

The transforms block is the same idea as the other relational guides. ExtractNewRecordState flattens Debezium's nested event into a plain row, delete.tombstone.handling.mode set to rewrite turns a delete into a row with an added __deleted field, and add.fields of op,source.ts_ms attaches the operation type and the event timestamp. We use the timestamp as our version. SQL Server orders changes by a log sequence number that is a multi-part value rather than a single number, so the event timestamp is the simplest version to use here, with the usual caveat that two changes in the same millisecond cannot be told apart.

Step 8: Register the ClickHouse sink connector

Create a file named clickhouse-sink.json:

{
  "name": "clickhouse-sink",
  "config": {
    "connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
    "tasks.max": "1",
    "topics": "store.dbo.airports",
    "hostname": "clickhouse",
    "port": "8123",
    "database": "store",
    "username": "default",
    "password": "clickhouse",
    "ssl": "false",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
 
    "transforms": "renameFields",
    "transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.renameFields.renames": "__source_ts_ms:_version,__deleted:_deleted"
  }
}

Send it:

curl -X POST -H "Content-Type: application/json" \
  --data @clickhouse-sink.json \
  http://localhost:8083/connectors

Step 9: Verify the CDC pipeline

This is the part most tutorials skip, so we cover it carefully. There are four checks, from source to destination.

First, confirm both connectors are running, not failed:

curl -s http://localhost:8083/connectors/sqlserver-source/status
curl -s http://localhost:8083/connectors/clickhouse-sink/status

Both should report a state of RUNNING. If the source connector failed, the message almost always points at CDC not being enabled or the Agent not running.

Second, confirm SQL Server is actually capturing changes. CDC writes to change tables; this query shows the captured tables:

docker compose exec -T sqlserver /opt/mssql-tools18/bin/sqlcmd \
  -U sa -P 'Str0ng!Passw0rd' -C -Q "USE store; EXEC sys.sp_cdc_help_change_data_capture;"

You should see a row for dbo.airports. An empty result means CDC was not enabled on the table.

Third, confirm events reached Kafka. Read one message from the topic:

docker compose exec kafka /opt/kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server kafka:9092 --topic store.dbo.airports \
  --from-beginning --max-messages 1

You should see a JSON message describing an airport. If the topic does not exist or has no messages, the source connector is not producing events.

Fourth, confirm the data landed in ClickHouse. The FINAL keyword collapses each row to its newest version, which you should always use with a ReplacingMergeTree:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT count() FROM store.airports FINAL"

The count should match the five rows you loaded. If all four checks pass, your pipeline is working end to end.

Step 10: See changes flow

Make some changes in SQL Server:

docker compose exec -T sqlserver /opt/mssql-tools18/bin/sqlcmd -U sa -P 'Str0ng!Passw0rd' -C -Q "
USE store;
UPDATE dbo.airports SET altitude = 5300 WHERE airport_id = 1;
INSERT INTO dbo.airports (airport_id, name, city, country, iata, altitude)
  VALUES (6, 'Wewak Intl', 'Wewak', 'Papua New Guinea', 'WWK', 19);
DELETE FROM dbo.airports WHERE airport_id = 2;
"

Wait a couple of seconds, then query ClickHouse again, hiding deleted rows:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT airport_id, name, altitude FROM store.airports FINAL WHERE _deleted = 0 ORDER BY airport_id"

You should see airport 1 with its updated altitude, the new airport 6, and airport 2 gone.

How updates and deletes really work

When you updated airport 1, SQL Server recorded the change in its CDC change table, Debezium emitted a flattened row with the new altitude and a newer timestamp, and the sink inserted it as a new row in ClickHouse. For a moment ClickHouse held two rows with airport_id = 1. Because the table is a ReplacingMergeTree keyed on airport_id with _version as the version, a FINAL query returns only the newest. The delete works the same way: Debezium emitted a row marked deleted, that row wins, and your query filters it out. The mental model is always: never change a row in place, append a newer version, and let FINAL give you the current state.

Production considerations

This tutorial runs a single node of everything. In production you would create a dedicated database user rather than using sa, use proper TLS certificates instead of disabling encryption, run one connector task per table, keep all events for a row in the same Kafka partition, run at least three Kafka brokers with replication factor three, and enable the ClickHouse sink's exactly-once mode when correctness is critical. Keep an eye on the CDC change tables, which SQL Server cleans up on a schedule; tune the retention if your connector ever falls behind.

Troubleshooting

If the source connector fails with a message about no maximum LSN, the SQL Server Agent is not running. Confirm MSSQL_AGENT_ENABLED is true and that CDC is enabled on both the database and the table.

If the connector fails with a TLS or certificate error, the encryption settings do not match. The database.encrypt of false above avoids this for local use.

If rows never reach ClickHouse, work through the four verification checks above in order to find where the chain breaks.

If updates appear duplicated, you forgot FINAL, or your ClickHouse ORDER BY key does not match the SQL Server primary key.

Cleaning up

docker compose down -v

References

What is next

You have now built this pipeline for a database with its own CDC feature. The Kafka and ClickHouse half is the same as every other source. Compare with the PostgreSQL, MySQL, MariaDB, Oracle, and MongoDB guides.

If you would like help designing a production-grade CDC pipeline into ClickHouse, the engineers at Quantrail Data do exactly this. Reach out through our services page and we will be glad to help.

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: