When working with ClickHouse®, writing a query is usually straightforward. Writing an efficient query, however, requires understanding how ClickHouse® reads and filters data.
Many users assume that adding a simple WHERE clause automatically results in fast query performance. While filtering is certainly important, not all filters are equally effective.
The difference between a query that scans millions of rows and one that scans only a fraction of them often comes down to how the filtering conditions align with the table design.
In this article, we’ll explore how filtering works in ClickHouse®, common mistakes to avoid, and practical best practices for improving query performance.
Why Filtering Matters
ClickHouse® is designed to process massive analytical datasets efficiently.
Even though the engine is incredibly fast, scanning unnecessary data still consumes:
- CPU resources
- Memory
- Disk I/O
- Network bandwidth
Effective filtering helps ClickHouse® eliminate irrelevant data as early as possible, reducing the amount of work required to execute a query.
For large datasets, proper filtering can significantly reduce query execution times.
Understanding How ClickHouse® Filters Data
Unlike traditional databases that often rely heavily on row-level indexes, ClickHouse® primarily uses:
- Partitions
- Sorting keys
- Primary indexes
- Data skipping indexes
These mechanisms help the engine determine which portions of data can safely be ignored during query execution.
The goal is not to find individual rows immediately but to avoid reading large sections of irrelevant data.
Start with the Sorting Key
One of the most important factors affecting filtering performance is the table’s sorting key.
Consider a table created as follows:
CREATE TABLE events
(
timestamp DateTime,
user_id UInt64,
event_type String
)
ENGINE = MergeTree
ORDER BY (user_id, timestamp);
Now consider this query:
SELECT *
FROM events
WHERE user_id = 12345;
Because the data is physically ordered by user_id, ClickHouse® can quickly identify the relevant data ranges and skip large portions of the table.
This is far more efficient than filtering on a column that does not participate in the sorting key.
Filter Early Whenever Possible
Filtering conditions should be applied as early as possible.
Good:
SELECT *
FROM events
WHERE timestamp >= today() - 7;
Less efficient:
SELECT *
FROM
(
SELECT *
FROM events
)
WHERE timestamp >= today() - 7;
Applying filters early allows ClickHouse® to reduce the amount of data processed throughout the query pipeline.
Use Time-Based Filters
Most analytical workloads involve time-series data.
Queries that restrict the time range often perform significantly better than queries that scan the entire dataset.
Example:
SELECT count()
FROM events
WHERE timestamp >= now() - INTERVAL 1 DAY;
Instead of reading years of historical data, ClickHouse® only examines the relevant portion of the dataset.
This becomes even more effective when partitions are based on time.
Avoid Functions on Filtered Columns
A common performance mistake involves applying functions directly to columns used in filtering conditions.
Example:
SELECT *
FROM events
WHERE toDate(timestamp) = today();
Although this query works, it may prevent ClickHouse® from efficiently using its indexing structures.
A better approach is:
SELECT *
FROM events
WHERE timestamp >= today()
AND timestamp < today() + 1;
This preserves the original column values and enables more effective data skipping.
Be Careful with OR Conditions
Queries containing multiple OR conditions can sometimes reduce filtering efficiency.
Example:
SELECT *
FROM events
WHERE country = 'US'
OR country = 'UK'
OR country = 'DE';
In many situations, using IN is clearer and may produce more efficient execution plans.
Example:
SELECT *
FROM events
WHERE country IN ('US', 'UK', 'DE');
This approach also improves query readability.
Use PREWHERE for Large Tables
One of the unique features of ClickHouse® is the PREWHERE clause.
Example:
SELECT *
FROM events
PREWHERE user_id = 12345;
PREWHERE allows ClickHouse® to load filtering columns first and postpone reading other columns until after irrelevant rows have been eliminated.
For wide tables containing many columns, this can reduce disk reads and improve performance.
In many cases, ClickHouse® automatically moves suitable conditions into PREWHERE, but understanding the feature is still valuable.
Leverage Data Skipping Indexes
For columns that are frequently filtered but not part of the sorting key, data skipping indexes may help.
Example:
INDEX idx_country country TYPE set(100) GRANULARITY 1
Data skipping indexes do not behave like traditional B-tree indexes.
Instead, they store lightweight metadata that helps ClickHouse® determine whether specific data blocks can be skipped entirely.
They can be particularly useful for:
- Country filters
- Status fields
- Categories
- Event types
When used appropriately, they can significantly reduce the amount of scanned data.
Avoid Filtering on Highly Random Columns
Columns with highly random values often provide limited opportunities for data skipping.
For example:
WHERE uuid = 'f47ac10b...'
If the column is not part of the sorting key and lacks an appropriate index, ClickHouse® may need to examine a large amount of data.
Whenever possible, design sorting keys around commonly filtered columns.
Check Query Performance
ClickHouse® provides several tools for understanding query behavior.
Useful system tables include:
system.query_log
system.parts
system.tables
Example:
References
Monitoring these metrics helps identify queries that are scanning more data than necessary.
Common Filtering Mistakes
Filtering Without a Time Range
Avoid:
SELECT *
FROM logs;
Large analytical tables should almost always include meaningful filtering conditions.
Ignoring Query Patterns
Designing tables without considering how users will filter data often leads to poor performance.
Overusing Data Skipping Indexes
Indexes are helpful, but they should complement good table design rather than replace it.
Applying Functions to Indexed Columns
This can limit ClickHouse®’s abilityReferences to skip irrelevant data efficiently.
Best Practices Summary
When filtering data in ClickHouse®:
- Align filters with sorting keys whenever possible.
- Filter early in the query pipeline.
- Use time-based filters.
- Avoid unnecessary functions on filtered columns.
- Prefer IN over long OR chains.
- Understand when PREWHERE can help.
- Consider data skipping indexes for frequently filtered columns.
- Monitor query performance using system tables.
- Design tables around actual query patterns.
Final Thoughts
Efficient filtering is one of the most important aspects of query optimization in ClickHouse®.
While the database is capable of scanning enormous datasets quickly, the best-performing queries are those that help ClickHouse® eliminate unnecessary data before it is read.
By understanding how partitions, sorting keys, PREWHERE, and data skipping indexes work together, you can dramatically reduce scanned rows, improve response times, and build more efficient analytical workloads.
The goal is not simply to write queries that work, but to write queries that allow ClickHouse® to do as little work as possible.
Exploring ClickHouse® for Your Analytics?
At Quantrail Data, we help teams run ClickHouse® reliably for real-time analytics – from Kubernetes deployments and migrations to performance tuning in production.
We see these challenges firsthand while supporting demanding analytics workloads. In one recent engagement, a customer achieved near bare-metal performance with ClickHouse® in production – a story we’ve shared here:
Success Story: Quantrail Bare-Metal ClickHouse® Deployment
If you’re evaluating ClickHouse® or trying to get more out of an existing setup, we’re happy to share practical lessons from real-world deployments.
Contact
Quantrail Data
References
ClickHouse® Documentation – WHERE Clause
ClickHouse® Documentation – PREWHERE Optimization
ClickHouse® Order By
