ClickHouse : Introduction to ReplacingMergeTree

Introduction

ClickHouse builds many of its heavy-duty table engines and features on the MergeTree engine. The MergeTree engine supports a PRIMARY KEY expression, but it differs from primary keys in traditional relational databases. In MergeTree, the system sorts and stores data on disk based on the primary key but does not automatically remove duplicates; they remain as-is. To remove duplicates, users must manually execute the OPTIMIZE ... FINAL ... DEDUPLICATE statement.

The ReplacingMergeTree engine offers a useful alternative in such scenarios by automatically removing duplicates based on the ORDER BY expression defined during table creation. Additionally, users can specify a version column, ensuring that the system retains the row with the highest version value.

Table Creation

Syntax

CREATE TABLE [database_name].table_name  
(
    column_name1 [data_type1],  
    ...  
)  
ENGINE = ReplacingMergeTree([optional_version_column])  
[ORDER BY expr]  
[PRIMARY KEY expr]  
[SETTINGS name=value, ...]

Example: Creating a Database and a ReplacingMergeTree Table

CREATE DATABASE mergetree;

CREATE TABLE mergetree.replacing_mergetree_example  
(  
    `ID` UInt64  
)  
ENGINE = ReplacingMergeTree  
PRIMARY KEY ID;

Inserting Data

INSERT INTO mergetree.replacing_mergetree_example VALUES (4), (4), (12), (12);

Optimizing the Table

Run the OPTIMIZE FINAL statement to ensure background merges are completed.

OPTIMIZE TABLE mergetree.replacing_mergetree_example FINAL;

Verifying Deduplication

Execute a SELECT statement to check if duplicates were removed:

SELECT * FROM mergetree.replacing_mergetree_example;

Output:

┌─ID─┐  
│  4 │  
│  12 │  
└────┘  
2 rows in set. Elapsed: 0.003 sec.

Version Column

Let’s explore the usage of the version column in the ReplacingMergeTree engine by creating a new table with a version column.

Creating a Table with a Version Column

CREATE TABLE mergetree.replacing_mergetree_versioned  
(  
    `ID` UInt64,  
    `version` Int64  
)  
ENGINE = ReplacingMergeTree(version)  
PRIMARY KEY ID;

Inserting Initial Data

INSERT INTO mergetree.replacing_mergetree_versioned VALUES (1,2), (2,2);

Reading the Inserted Data

SELECT * FROM mergetree.replacing_mergetree_versioned;

Output:

┌─ID─┬─version─┐  
│  1 │      2  │  
│  2 │      2  │  
└────┴─────────┘  
2 rows in set. Elapsed: 0.003 sec.

Inserting More Data with a Different Version

INSERT INTO mergetree.replacing_mergetree_versioned VALUES (1,1), (2,1);

Reading Data After Insert

SELECT * FROM mergetree.replacing_mergetree_versioned;

Output:

┌─ID─┬─version─┐  
│  1 │      2  │  
│  2 │      2  │  
└────┴─────────┘  

┌─ID─┬─version─┐  
│  1 │      1  │  
│  2 │      1  │  
└────┴─────────┘  
4 rows in set. Elapsed: 0.003 sec.

Final Deduplication Using OPTIMIZE FINAL

OPTIMIZE TABLE mergetree.replacing_mergetree_versioned FINAL;

Reading the Final Result

SELECT * FROM mergetree.replacing_mergetree_versioned;

Output:

┌─ID─┬─version─┐  
│  1 │      2  │  
│  2 │      2  │  
└────┴─────────┘  
2 rows in set. Elapsed: 0.003 sec.

Conclusion

This article explored the applications of the ReplacingMergeTree table engine. Specifically, it covered:

  • Creating a table using the ReplacingMergeTree engine.
  • Specifying an optional version column to control the deduplication process.

In summary, the ReplacingMergeTree engine provides an efficient way to handle duplicate data in ClickHouse by automatically retaining the most relevant rows based on the defined sorting key or version column. By leveraging this engine, users can streamline data management and ensure cleaner, more accurate datasets without manual intervention.

References

https://clickhouse.com/docs/sql-reference/statements/optimize

https://clickhouse.com/docs/guides/replacing-merge-tree

Image Courtesy: Photo by Venelin Dimitrov from Pexels