,

Optimizing ClickHouse Queries: From Data Skipping Indexes to Projections

Reshma M avatar
Optimizing ClickHouse Queries: From Data Skipping Indexes to Projections

Optimizing ClickHouse Queries requires a clear understanding of how ClickHouse stores data and how it decides what data to read. In this blog, we’ll walk through a real optimization journey, starting from ClickHouse’s storage model, identifying slow queries, and finally fixing them using data skipping indexes and projections, backed by real query logs and experiments.

1. Understanding How ClickHouse Stores Data

Before talking about indexes or projections, we must understand how data is physically stored from Inserts to Disk,

When data is inserted into ClickHouse:

  1. Data is written into parts.
  2. Each part is stored in columnar format.
  3. Each column is split into granules.
  4. Each granule contains 8192 rows (by default)

Granule is the smallest unit , ClickHouse reads from disk.

ClickHouse never reads individual rows. It always reads entire granules.

2. What Is a Granule ?

A granule (also called a block):

  • Contains 8192 rows ( Configurable )
  • Exists per column
  • Has metadata like min/max values ( for some indexes )
  • Is either read fully or skipped fully

Example:

  • Granule 1 → rows 1–8192
  • Granule 2 → rows 8193–16384
  • Granule 3 → rows 16385–24576

If even one row in a granule might match the query, ClickHouse must read the entire granule.

So optimization = skip as many granules as possible

3. How Data Skipping Works Internally

When a query runs, ClickHouse follows this decision flow:

  1. Check metadata (indexes, projections, primary key)
  2. Decide which granules can be skipped
  3. Read only the required granules
  4. Apply filters on actual rows

Skipping happens before disk reads, which is why it’s so powerful.

4. The Example Dataset: UK Price Paid Fact Table

We’ll use a realistic analytical fact table:

CREATE TABLE uk.uk_price_paid_fact
(
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8(
        'terraced' = 1,
        'semi-detached' = 2,
        'detached' = 3,
        'flat' = 4,
        'other' = 0
    ),
    is_new UInt8,
    duration Enum8(
        'freehold' = 1,
        'leasehold' = 2,
        'unknown' = 0
    ),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2);

Why ORDER BY Is Critical

  • Data is physically sorted by (postcode1, postcode2, addr1, addr2).
  • ClickHouse can efficiently skip data only when queries filter on these columns.
  • Filters on other Columns such as town, price, date, county are not optimized by default.

This design choice directly impacts query performance.

5. Running Analytical Queries

Let’s execute realistic analytical queries such as:

SELECT avg(price)
FROM uk.uk_price_paid_fact
WHERE town = 'ST ALBANS';

SELECT county, avg(price)
FROM uk.uk_price_paid_fact
GROUP BY county;

SELECT count()
FROM uk.uk_price_paid_fact
WHERE price > 500000;

These queries are common in dashboards and reports.

However, they all:

  • Filter or group by columns not in ORDER BY
  • Force ClickHouse to scan almost the entire table

6. Identifying Heavy Queries from Logs

SELECT
    query,
    read_rows
FROM system.query_log
WHERE query LIKE '%uk_price_paid_fact%'
ORDER BY read_rows DESC
LIMIT 10;

Example Screenshot Showing Heavy Queries

Observation

Query PatternRows Read
WHERE town = ‘ST ALBANS’~30M
GROUP BY county~30M
WHERE price > 500000~30M
GROUP BY year~30M

ClickHouse was not skipping granules at all ( it reads the entire 30..million rows )

7. Why ClickHouse Could Not Skip Data

The reason was clear:

  1. Filters used non-primary key columns
  2. No data skipping indexes existed
  3. GROUP BY queries re-scanned raw data every time

This tells us what optimization to apply and where.

8. When to Use a Data Skipping Index

Use a data skipping index when:

  • Column appears frequently in WHERE
  • Column is not in ORDER BY
  • Queries scan millions of rows

Example: Creating an Index

ALTER TABLE uk.uk_price_paid_fact
ADD INDEX idx_town town TYPE set(1000) GRANULARITY 4;

How a SET Index Works Internally

For each granule, ClickHouse stores the set of unique values of that column

Example granule: ( in real 8192 rows )

Rowtown
1ST ALBANS
2ST ALBANS
3LUTON
4LUTON

Index summary for this block :

{ "ST ALBANS", "LUTON" }

When we Query like,

WHERE town = 'BRISTOL'

ClickHouse checks block’s Index Set:

{“ST ALBANS”, “LUTON”}

Since, BRISTOL not in set

  • Granule skipped completely
  • Zero disk read

That’s data skipping in action.

9. When to Use a Projection

Use a projection when:

  • Same aggregation runs repeatedly
  • Queries use GROUP BY
  • Aggregation cost dominates execution time

Example Query

SELECT county, avg(price)
FROM uk.uk_price_paid_fact
GROUP BY county;

Creating a Projection

ALTER TABLE uk.uk_price_paid_fact
ADD PROJECTION proj_county_avg
(
    SELECT
        county,
        avg(price)
    GROUP BY county
);

What Happens Internally?

  • ClickHouse pre-aggregates data
  • Stores it as a separate internal structure
  • Skips base table granules entirely
  • It reads directly from the Projection

10. Index vs Projection: Decision Rule

Query PatternOptimization method
WHERE on non-ORDER BYData Skipping Index
GROUP BY aggregationProjection
WHERE + GROUP BYBoth
  • Indexes skip irrelevant granules
  • Projections skip raw data scans altogether

11. Final Optimization Workflow

  1. Understand table storage & ORDER BY
  2. Run analytical queries
  3. Enable query logging
  4. Identify heavy queries (read_rows)
  5. Add data skipping indexes for filters
  6. Add projections for repeated aggregations
  7. Validate using logs and EXPLAIN

Key Takeaway

Optimizing ClickHouse query is about skipping granules, not rows.

Once you understand:

  • Granule storage
  • Index summaries
  • Projection-based skipping

ClickHouse optimization becomes structured and predictable.

What’s Next?

This blog focused on understanding why and when to optimize using data skipping indexes and projections.

In the upcoming blogs, I’ll dive deeper into:

  • In upcoming blogs, I’ll deep dive into adding and tuning data skipping indexes and projections, along with before-and-after benchmarks to clearly show their performance impact.

If you want a deeper, hands-on understanding of adding and tuning indexes and projections, stay tuned for the next posts in this series.

Reference