ClickHouse Projections : Turbocharge your queries

Introduction

Optimizing database performance is essential for maintaining fast and reliable data access in modern businesses. As data volumes grow, ensuring efficiency becomes increasingly challenging, making performance tuning a key responsibility for DBAs.

While different databases require specific optimization techniques, one effective strategy for addressing performance bottlenecks is conducting regular assessments. These evaluations help fine-tune database operations and improve overall efficiency.

In this article, we’ll explore projections—a powerful performance optimization feature in ClickHouse—and how they can enhance query speed and resource utilization.

What Are Projections in ClickHouse?

Projections in ClickHouse are a performance optimization feature that improve query efficiency by precomputing and storing data in an optimized format. They function similarly to materialized views but are tightly integrated with ClickHouse’s storage engine, making them more efficient and automatically maintained.

How Do Projections Work?

  • Optimized Storage Format – Projections store data in an alternative layout that speeds up queries without requiring manual duplication.
  • Automatic Query Optimization – ClickHouse automatically selects the most efficient projection without modifying the user’s query.
  • Faster Query Execution – By pre-aggregating or sorting data differently, projections reduce computation and I/O, leading to better performance.

Use Cases for Projections

  • Running queries on columns that are not part of the primary key
  • Pre-aggregating data to reduce computation and I/O
  • Optimizing queries by automatically selecting the best projection based on the least amount of data to scan

Limitations & Disk Usage Considerations

  • Increased Disk Usage – Projections create an internal hidden table, increasing disk space usage and I/O. If a projection defines a different primary key, all data from the original table is duplicated, leading to higher storage requirements.
  • Supported Only on MergeTree Tables – Projection manipulation is available only for tables using the MergeTree engine, including its replicated variants.

Projections are particularly useful for large-scale analytical workloads where queries involve frequent aggregations, sorting, and filtering. By using projections effectively, organizations can optimize their ClickHouse databases for better performance and resource efficiency.

Let us look at few examples. Consider the following table, which contains approximately 12 million data points:

CREATE TABLE projection_example
(
    `id` UInt64,
    `name` String,
    `value` String
)
ENGINE = MergeTree
ORDER BY id;

In this setup, the primary key ensures that queries filtering by the id column are optimized, as the data is sorted accordingly. However, filtering by the name or value columns requires a full table scan since they are not indexed.

As seen in the EXPLAIN output, every granule and data part is fully read when filtering on these columns, leading to less efficient query performance.

Let’s examine the query execution:

EXPLAIN json = 1, indexes = 1 
SELECT count(*) FROM projection_example WHERE name = 'tom';

Results:

"Initial Parts": 6,
"Selected Parts": 6,
"Initial Granules": 12209,
"Selected Granules": 12209

Using Projections with a Single Column

Projections in ClickHouse allow the creation of additional sorted datasets for table columns, improving query performance. A projection is essentially a stored and structured subset of the original table’s data, optimized for specific query patterns.

To optimize queries filtering by the name column, we can create a projection on it. This projection will automatically process and store newly inserted data. However, to apply projections to existing data, the MATERIALIZE action is required.

ALTER TABLE projection_example ADD PROJECTION name (SELECT name ORDER BY name);
ALTER TABLE projection_example MATERIALIZE PROJECTION name;

As a result, similar to the primary key column, the data in the name column is now sorted. This significantly improves query performance, as demonstrated by the following execution plan:

┌─explain───────────────────────────────────────────────────────────────────────────┐
│ [
  {
    "Plan": {
      "Node Type": "Expression",
      "Description": "(Projection + Before ORDER BY)",
      "Plans": [
        {
          "Node Type": "Aggregating",
          "Plans": [
            {
              "Node Type": "Expression",
              "Description": "Before GROUP BY",
              "Plans": [
                {
                  "Node Type": "ReadFromStorage",
                  "Description": "MergeTree (with Normal projection name)"
                }
              ]
            }
          ]
        }
      ]
    }
  }
] │
└───────────────────────────────────────────────────────────────────────────┘

Now, let’s compare query performance with and without projection:

Without Projection

SELECT count(*) FROM projection_example WHERE name = 'tom';
┌─count()─┐
│ 8745212 │
└─────────┘
Elapsed: 0.225 sec.  
Processed 81.77 million rows, 11.97 MB

ClickHouse must scan the entire dataset to retrieve the result.

With Projection

SELECT count(*) FROM projection_example WHERE name = 'tom';
┌─count()─┐
│ 8745212 │
└─────────┘
Elapsed: 0.087 sec.  
Processed 4.31 million rows, 4.31 MB

With a projection, query execution becomes significantly more efficient. As the dataset grows, the performance gains become even more pronounced, reducing query latency and improving overall responsiveness.

Using Projections with Multiple Columns

We can further enhance performance by creating projections on multiple columns:

ALTER TABLE projection_example ADD PROJECTION name_value (SELECT name, value ORDER BY name, value);
ALTER TABLE projection_example MATERIALIZE PROJECTION name_value;

Now, both the name and value columns can be queried together or separately. Even when querying only the value column, the projection remains beneficial by optimizing data retrieval and reducing the need for a full table scan.

Dropping a Projection

Before implementing projections, it is important to analyze query patterns carefully to ensure they provide a performance advantage. In cases where a projection negatively affects system performance—particularly in tables with frequent INSERT operations—it may be necessary to remove it.

ALTER TABLE projection_example DROP PROJECTION name;
ALTER TABLE projection_example DROP PROJECTION name_value;

Conclusion

Projections are a powerful optimization feature in ClickHouse, enhancing query performance when the primary key alone is insufficient. However, they should be used selectively, as they can slow down data ingestion in high INSERT workloads. By carefully analyzing query patterns, businesses can balance performance gains with system efficiency.

Need ClickHouse Support?
Quantrail offers expert performance tuning, query optimization, and troubleshooting for ClickHouse. Contact us for tailored solutions:

References

https://clickhouse.com/docs/sql-reference/statements/alter/projection

https://chistadata.com/clickhouse-projections-for-query-optimization/#:~:text=Projections%20can%20be%20used%20to,be%20used%20for%20specific%20queries.

Image Courtesy: Photo by Pixabay from Pexels