Introduction
ClickHouse builds many of its heavy-duty table engines and features on the MergeTree engine. The MergeTree engine supports a PRIMARY KEY expression, but it differs from primary keys in traditional relational databases. In MergeTree, the system sorts and stores data on disk based on the primary key but does not automatically remove duplicates; they remain as-is. To remove duplicates, users must manually execute the OPTIMIZE ... FINAL ... DEDUPLICATE
statement.
The ReplacingMergeTree engine offers a useful alternative in such scenarios by automatically removing duplicates based on the ORDER BY
expression defined during table creation. Additionally, users can specify a version column, ensuring that the system retains the row with the highest version value.
Table Creation
Syntax
CREATE TABLE [database_name].table_name
(
column_name1 [data_type1],
...
)
ENGINE = ReplacingMergeTree([optional_version_column])
[ORDER BY expr]
[PRIMARY KEY expr]
[SETTINGS name=value, ...]
Example: Creating a Database and a ReplacingMergeTree Table
CREATE DATABASE mergetree;
CREATE TABLE mergetree.replacing_mergetree_example
(
`ID` UInt64
)
ENGINE = ReplacingMergeTree
PRIMARY KEY ID;
Inserting Data
INSERT INTO mergetree.replacing_mergetree_example VALUES (4), (4), (12), (12);
Optimizing the Table
Run the OPTIMIZE FINAL
statement to ensure background merges are completed.
OPTIMIZE TABLE mergetree.replacing_mergetree_example FINAL;
Verifying Deduplication
Execute a SELECT
statement to check if duplicates were removed:
SELECT * FROM mergetree.replacing_mergetree_example;
Output:
┌─ID─┐
│ 4 │
│ 12 │
└────┘
2 rows in set. Elapsed: 0.003 sec.
Version Column
Let’s explore the usage of the version column in the ReplacingMergeTree engine by creating a new table with a version column.
Creating a Table with a Version Column
CREATE TABLE mergetree.replacing_mergetree_versioned
(
`ID` UInt64,
`version` Int64
)
ENGINE = ReplacingMergeTree(version)
PRIMARY KEY ID;
Inserting Initial Data
INSERT INTO mergetree.replacing_mergetree_versioned VALUES (1,2), (2,2);
Reading the Inserted Data
SELECT * FROM mergetree.replacing_mergetree_versioned;
Output:
┌─ID─┬─version─┐
│ 1 │ 2 │
│ 2 │ 2 │
└────┴─────────┘
2 rows in set. Elapsed: 0.003 sec.
Inserting More Data with a Different Version
INSERT INTO mergetree.replacing_mergetree_versioned VALUES (1,1), (2,1);
Reading Data After Insert
SELECT * FROM mergetree.replacing_mergetree_versioned;
Output:
┌─ID─┬─version─┐
│ 1 │ 2 │
│ 2 │ 2 │
└────┴─────────┘
┌─ID─┬─version─┐
│ 1 │ 1 │
│ 2 │ 1 │
└────┴─────────┘
4 rows in set. Elapsed: 0.003 sec.
Final Deduplication Using OPTIMIZE FINAL
OPTIMIZE TABLE mergetree.replacing_mergetree_versioned FINAL;
Reading the Final Result
SELECT * FROM mergetree.replacing_mergetree_versioned;
Output:
┌─ID─┬─version─┐
│ 1 │ 2 │
│ 2 │ 2 │
└────┴─────────┘
2 rows in set. Elapsed: 0.003 sec.
Conclusion
This article explored the applications of the ReplacingMergeTree table engine. Specifically, it covered:
- Creating a table using the ReplacingMergeTree engine.
- Specifying an optional version column to control the deduplication process.
In summary, the ReplacingMergeTree engine provides an efficient way to handle duplicate data in ClickHouse by automatically retaining the most relevant rows based on the defined sorting key or version column. By leveraging this engine, users can streamline data management and ensure cleaner, more accurate datasets without manual intervention.
References
https://clickhouse.com/docs/sql-reference/statements/optimize
https://clickhouse.com/docs/guides/replacing-merge-tree
Image Courtesy: Photo by Venelin Dimitrov from Pexels