When Should High-Cardinality Columns Come First in ClickHouse ORDER BY?

A commonly repeated rule in ClickHouse modeling is to place low-cardinality columns first in the ORDER BY clause. While this advice is often helpful, it is incomplete and in some cases, misleading.

In practice, there are situations where placing a high-cardinality column first is not only acceptable but optimal. Understanding when and why requires looking beyond cardinality and into how ClickHouse actually stores and prunes data.

This article explains how the table-level ORDER BY works internally in ClickHouse, when high-cardinality columns should come first, which data types are preferred, and how to make correct decisions based on real query patterns. It focuses specifically on the ORDER BY defined at table creation time in MergeTree engines, not the ORDER BY used in SELECT queries.

What ORDER BY Really Means in ClickHouse

In ClickHouse, ORDER BY does not primarily control result ordering. Instead, it defines how data is physically sorted on disk and how the primary index is built.

Key points:

ORDER BY defines the sort key for MergeTree tables
This sort key forms the primary index
The primary index operates on granules, not individual rows
Each granule typically contains ~8,192 rows

When a query runs, ClickHouse uses the primary index to skip entire granules that cannot satisfy query conditions. The effectiveness of this pruning is what determines performance.

In ClickHouse, ORDER BY is about data layout and pruning not presentation.

Cardinality vs Randomness: The Crucial Distinction

A common misconception is that high cardinality itself is the problem. In reality, the more important factor is randomness.

Cardinality affects compression efficiency
Randomness affects index usefulness and pruning

Examples:

user_id (high cardinality, often queried by equality) → frequently a good candidate
UUID (high cardinality and random) → often problematic
Hash values → usually poor choices
Time-ordered identifiers → often work well despite high cardinality

High cardinality is not inherently bad. High randomness usually is.

When High-Cardinality Columns Should Come First

Placing a high-cardinality column first in ORDER BY is correct when it improves granule pruning.

This typically applies when:

Queries frequently filter by equality on that column
The column is highly selective
Values are not fully random
The ingestion order aligns reasonably with the column

Common valid patterns

(user_id, event_time)
(tenant_id, timestamp)
(order_id, created_at)

In these cases:

Equality filters sharply narrow the scan range
Granules become tightly grouped by the filtered value
Subsequent range filters (like time) become more efficient

This pattern is common in multi-tenant systems and user-centric analytics.

When High-Cardinality Columns Should Not Come First

High-cardinality columns should not lead the ORDER BY when:

Values are randomly distributed (UUIDv4, hashes)
The column is rarely used in WHERE clauses
Inserts arrive in highly unordered fashion
Compression efficiency is critical and selectivity is low

Consequences include:

Poor compression
Scattered granules
Ineffective index pruning
Increased disk and IO usage

In such cases, leading with time or another ordered column is usually better.

Preferred Data Types for ORDER BY Columns

The choice of data type significantly impacts performance.

Avoid when possible

String (especially high-cardinality)
Nullable columns
Random UUIDs

Why?

Fixed-width types compress better
Comparisons are faster
CPU cache behavior improves

ORDER BY columns should be cheap to compare and cheap to compress.

Choosing ORDER BY Based on Query Patterns

The best ORDER BY is driven by how the table is queried, not generic rules.

A simple decision framework:

Which columns appear most often in WHERE?
Are filters mostly equality or range-based?
Which filters are most selective?
Is time a primary access dimension?
Is the column ordered or random?

Example patterns

Event data → (event_time, dimension)
User activity → (user_id, event_time)
Multi-tenant analytics → (tenant_id, timestamp)
Logs → (timestamp, severity)

There is no universal answer – only workload-specific ones.

Common ORDER BY Mistakes

Copying schemas from OLTP databases
Leading with UUIDs by default
Optimizing only for compression
Ignoring ingestion order
Treating cardinality as the sole metric

Most performance issues stem from misunderstanding how the primary index works.

Key Takeaways

ORDER BY defines physical layout, not output order
Cardinality alone is not the deciding factor
Randomness is often the real problem
High-cardinality columns can be correct when they align with query patterns
Data types and selectivity matter
Always design ORDER BY around how data is queried

Understanding these principles allows you to move beyond rules of thumb and design schemas that perform predictably at scale.

Exploring ClickHouse® for Your Analytics?

At Quantrail Data, we help teams run ClickHouse reliably for real-time analytics from Kubernetes deployments and migrations to performance tuning in production.

We see these challenges firsthand while supporting demanding analytics workloads. In one recent engagement, a customer achieved near bare-metal performance with ClickHouse® in production a story we’ve shared here:

Success Story: Quantrail Bare-Metal ClickHouse® Deployment

If you’re evaluating ClickHouse® or trying to get more out of an existing setup, we’re happy to share practical lessons from real-world deployments.

Contact
Quantrail Data

References

ClickHouse Documentation – MergeTree and Primary Index
ClickHouse Documentation – Data Skipping Indexes

Post Views: 585