A commonly repeated rule in ClickHouse modeling is to place low-cardinality columns first in the ORDER BY clause. While this advice is often helpful, it is incomplete and in some cases, misleading.
In practice, there are situations where placing a high-cardinality column first is not only acceptable but optimal. Understanding when and why requires looking beyond cardinality and into how ClickHouse actually stores and prunes data.
This article explains how the table-level ORDER BY works internally in ClickHouse, when high-cardinality columns should come first, which data types are preferred, and how to make correct decisions based on real query patterns. It focuses specifically on the ORDER BY defined at table creation time in MergeTree engines, not the ORDER BY used in SELECT queries.
What ORDER BY Really Means in ClickHouse
In ClickHouse, ORDER BY does not primarily control result ordering. Instead, it defines how data is physically sorted on disk and how the primary index is built.
Key points:
ORDER BYdefines the sort key for MergeTree tables- This sort key forms the primary index
- The primary index operates on granules, not individual rows
- Each granule typically contains ~8,192 rows
When a query runs, ClickHouse uses the primary index to skip entire granules that cannot satisfy query conditions. The effectiveness of this pruning is what determines performance.
In ClickHouse,
ORDER BYis about data layout and pruning not presentation.
Cardinality vs Randomness: The Crucial Distinction
A common misconception is that high cardinality itself is the problem. In reality, the more important factor is randomness.
- Cardinality affects compression efficiency
- Randomness affects index usefulness and pruning
Examples:
user_id(high cardinality, often queried by equality) → frequently a good candidateUUID(high cardinality and random) → often problematic- Hash values → usually poor choices
- Time-ordered identifiers → often work well despite high cardinality
High cardinality is not inherently bad. High randomness usually is.
When High-Cardinality Columns Should Come First
Placing a high-cardinality column first in ORDER BY is correct when it improves granule pruning.
This typically applies when:
- Queries frequently filter by equality on that column
- The column is highly selective
- Values are not fully random
- The ingestion order aligns reasonably with the column
Common valid patterns
(user_id, event_time)(tenant_id, timestamp)(order_id, created_at)
In these cases:
- Equality filters sharply narrow the scan range
- Granules become tightly grouped by the filtered value
- Subsequent range filters (like time) become more efficient
This pattern is common in multi-tenant systems and user-centric analytics.
When High-Cardinality Columns Should Not Come First
High-cardinality columns should not lead the ORDER BY when:
- Values are randomly distributed (UUIDv4, hashes)
- The column is rarely used in
WHEREclauses - Inserts arrive in highly unordered fashion
- Compression efficiency is critical and selectivity is low
Consequences include:
- Poor compression
- Scattered granules
- Ineffective index pruning
- Increased disk and IO usage
In such cases, leading with time or another ordered column is usually better.
Preferred Data Types for ORDER BY Columns
The choice of data type significantly impacts performance.
Recommended
Date,DateTime,DateTime64UInt*,Int*- Fixed-width numeric types
Avoid when possible
String(especially high-cardinality)Nullablecolumns- Random UUIDs
Why?
- Fixed-width types compress better
- Comparisons are faster
- CPU cache behavior improves
ORDER BY columns should be cheap to compare and cheap to compress.
Choosing ORDER BY Based on Query Patterns
The best ORDER BY is driven by how the table is queried, not generic rules.
A simple decision framework:
- Which columns appear most often in
WHERE? - Are filters mostly equality or range-based?
- Which filters are most selective?
- Is time a primary access dimension?
- Is the column ordered or random?
Example patterns
- Event data →
(event_time, dimension) - User activity →
(user_id, event_time) - Multi-tenant analytics →
(tenant_id, timestamp) - Logs →
(timestamp, severity)
There is no universal answer – only workload-specific ones.
Common ORDER BY Mistakes
- Copying schemas from OLTP databases
- Leading with UUIDs by default
- Optimizing only for compression
- Ignoring ingestion order
- Treating cardinality as the sole metric
Most performance issues stem from misunderstanding how the primary index works.
Key Takeaways
ORDER BYdefines physical layout, not output order- Cardinality alone is not the deciding factor
- Randomness is often the real problem
- High-cardinality columns can be correct when they align with query patterns
- Data types and selectivity matter
- Always design
ORDER BYaround how data is queried
Understanding these principles allows you to move beyond rules of thumb and design schemas that perform predictably at scale.
Exploring ClickHouse® for Your Analytics?
At Quantrail Data, we help teams run ClickHouse reliably for real-time analytics from Kubernetes deployments and migrations to performance tuning in production.
We see these challenges firsthand while supporting demanding analytics workloads. In one recent engagement, a customer achieved near bare-metal performance with ClickHouse® in production a story we’ve shared here:
Success Story: Quantrail Bare-Metal ClickHouse® Deployment
If you’re evaluating ClickHouse® or trying to get more out of an existing setup, we’re happy to share practical lessons from real-world deployments.
Contact
Quantrail Data
References
ClickHouse Documentation – MergeTree and Primary Index
ClickHouse Documentation – Data Skipping Indexes
