ClickHouse MergeTree: Introduction to ClickHouse Storage Engine Types

Introduction

Data is a powerful collection of information that serves multiple purposes, including large-scale data analysis and external data integration in ClickHouse. The various database engines are pivotal to these applications, providing targeted solutions for a wide range of use cases and patterns. It is imperative to understand how these engines influence your data and to identify the engine that aligns best with your architecture. The distinct variety of engines in ClickHouse demands the expertise of skilled DBAs to make decisive and informed choices.

What are Storage Engine Types in ClickHouse?

For datasets with fewer than 1 million rows, a log family engine is recommended. The MergeTree family, however, offers distinct advantages. Each engine type serves specific purposes, and the following schema illustrates ClickHouse engine architecture. This article explores the key features of MergeTree technology within the MergeTree family of table engines.

The MergeTree family is the top choice for high-load environments and large-scale data analysis. It includes key features like data partitioning, data sampling, and Time to Live (TTL). Moreover, all the MergeTree engines have their replicated counterpart confirming that the MergeTree family not only excels in these areas but also supports reliable replication solutions.

What is MergeTree?

The MergeTree engine is the primary engine in the MergeTree family and supports nearly all core functions of ClickHouse. This engine is designed for efficiently inserting large volumes of data into a table and writing it to disk after sorting based on the primary key. Additionally, it governs the rules applied during the background merging process. The MergeTree table engine is particularly well-suited for single-node server environments.

 Let’s explore how it works: 

In the following DDL, we will create a table using the MergeTree engine and examine its features:

CREATE TABLE TestTable
(
id UInt16,
created_time Date,
comment String
) 
ENGINE = MergeTree()
ORDER BY  (id, created_time)
PARTITION BY toYYYYMM(created_time)
TTL created_time + INTERVAL 3 MONTH;

Defining the mandatory ENGINE and ORDER BY parameters is sufficient to create the table, but additional/optional PARTITION BY and TTL parameters must align with our business needs.

In the MergeTree engine, primary keys do not prevent duplication; thus, multiple rows with the same primary key can be inserted. 

In the provided DDL, the primary key is (id, created_time), as these values are the sort key in the ORDER BY clause. The data is sorted by the primary key and partitioned by month based on the PARTITION BY value. Additionally, partition names will start with the format 202208_*

It’s not always advisable to store data indefinitely. As per this TTL setting, only the data from the past three months will be retained in storage. 

Now, let’s insert some data records into the testTable, ensuring that they share the same primary key.

INSERT INTO TestTable VALUES (1,'2022-08-11', 'hello');

INSERT INTO TestTable VALUES (1,'2022-08-11', 'world');

INSERT INTO TestTable VALUES (2,'2022-08-15', 'this');

INSERT INTO TestTable VALUES (2,'2022-08-15', 'is');

INSERT INTO TestTable VALUES (2,'2022-08-17', 'ChistaDATA');

The MergeTree structure is comparable to the LSM-tree, allowing ClickHouse to achieve high write throughput . When examining the default data path, it becomes evident that each data partition operates independently, with no technical connections between them until background compression is applied. This design helps minimize the amount of data that needs to be retrieved during read operations.

root@MT01:/var/lib/clickhouse/data/default/testTable# ls -lrt

total 32

-rw-r----- 1 clickhouse clickhouse    1 Aug 29 16:26 format_version.txt

drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 detached

drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_1_1_0

drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_2_2_0

drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_3_3_0

drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_4_4_0

drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_5_5_0

Eventually, the MergeTree background processes were enacted by the merge pool, leading to merging of data that belongs to the same partitions.

root@MT01:/var/lib/clickhouse/data/default/TestTable# ls -lrt

total 12

-rw-r----- 1 clickhouse clickhouse    1 Aug 29 16:26 format_version.txt

drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 detached

drwxr-x--- 2 clickhouse clickhouse 4096 Aug 29 16:26 202208_1_5_1

What about SELECT?

We’ve focused on INSERT performance, but how effective are your SELECT queries? Are you using primary keys properly?

SHOW CREATE TABLE TestTable;

To find the DDL (Data Definition Language) statement for our table, that has the information including primary keys, simply run the command mentioned above. This method is particularly useful for understanding indexes before executing complex queries.

To get the best performance, choose your PRIMARY KEY/ ORDER BY key to contains columns that are frequently used to filter the data (WHERE clause) and define them while creating the table in the order of increasing cardnality.

Conclusion

The MergeTree engine family is the cornerstone of ClickHouse’s performance and scalability, designed for efficient handling of large-scale data. Its ability to support partitioning, indexing, and automatic data sorting makes it ideal for analytical workloads. With variations like ReplacingMergeTree, SummingMergeTree, and VersionedCollapsingMergeTree, the MergeTree family offers flexibility to address specific use cases such as deduplication, aggregation, and version control. Understanding the nuances of each MergeTree variant is essential for designing a robust and optimized data architecture in ClickHouse.

References

https://chistadata.com/clickhouse-mergetree-engine-types/#:~:text=The%20MergeTree%20family%20is%20the,%2C%20data%20sampling%2C%20and%20TTL.

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family

Image Courtesy: Photo by Pixabay from Pexels