,

When Should High-Cardinality Columns Come First in ClickHouse ORDER BY?

Mohamed Hussain S avatar
When Should High-Cardinality Columns Come First in ClickHouse ORDER BY?

A commonly repeated rule in ClickHouse modeling is to place low-cardinality columns first in the ORDER BY clause. While this advice is often helpful, it is incomplete and in some cases, misleading.

In practice, there are situations where placing a high-cardinality column first is not only acceptable but optimal. Understanding when and why requires looking beyond cardinality and into how ClickHouse actually stores and prunes data.

This article explains how the table-level ORDER BY works internally in ClickHouse, when high-cardinality columns should come first, which data types are preferred, and how to make correct decisions based on real query patterns. It focuses specifically on the ORDER BY defined at table creation time in MergeTree engines, not the ORDER BY used in SELECT queries.

In ClickHouse, ORDER BY does not primarily control result ordering. Instead, it defines how data is physically sorted on disk and how the primary index is built.

Key points:

  • ORDER BY defines the sort key for MergeTree tables
  • This sort key forms the primary index
  • The primary index operates on granules, not individual rows
  • Each granule typically contains ~8,192 rows

When a query runs, ClickHouse uses the primary index to skip entire granules that cannot satisfy query conditions. The effectiveness of this pruning is what determines performance.

In ClickHouse, ORDER BY is about data layout and pruning not presentation.

A common misconception is that high cardinality itself is the problem. In reality, the more important factor is randomness.

  • Cardinality affects compression efficiency
  • Randomness affects index usefulness and pruning

Examples:

  • user_id (high cardinality, often queried by equality) → frequently a good candidate
  • UUID (high cardinality and random) → often problematic
  • Hash values → usually poor choices
  • Time-ordered identifiers → often work well despite high cardinality

High cardinality is not inherently bad. High randomness usually is.

Placing a high-cardinality column first in ORDER BY is correct when it improves granule pruning.

This typically applies when:

  • Queries frequently filter by equality on that column
  • The column is highly selective
  • Values are not fully random
  • The ingestion order aligns reasonably with the column

Common valid patterns

  • (user_id, event_time)
  • (tenant_id, timestamp)
  • (order_id, created_at)

In these cases:

  • Equality filters sharply narrow the scan range
  • Granules become tightly grouped by the filtered value
  • Subsequent range filters (like time) become more efficient

This pattern is common in multi-tenant systems and user-centric analytics.

High-cardinality columns should not lead the ORDER BY when:

  • Values are randomly distributed (UUIDv4, hashes)
  • The column is rarely used in WHERE clauses
  • Inserts arrive in highly unordered fashion
  • Compression efficiency is critical and selectivity is low

Consequences include:

  • Poor compression
  • Scattered granules
  • Ineffective index pruning
  • Increased disk and IO usage

In such cases, leading with time or another ordered column is usually better.

The choice of data type significantly impacts performance.

Recommended

  • Date, DateTime, DateTime64
  • UInt*, Int*
  • Fixed-width numeric types

Avoid when possible

  • String (especially high-cardinality)
  • Nullable columns
  • Random UUIDs

Why?

  • Fixed-width types compress better
  • Comparisons are faster
  • CPU cache behavior improves

ORDER BY columns should be cheap to compare and cheap to compress.

The best ORDER BY is driven by how the table is queried, not generic rules.

A simple decision framework:

  1. Which columns appear most often in WHERE?
  2. Are filters mostly equality or range-based?
  3. Which filters are most selective?
  4. Is time a primary access dimension?
  5. Is the column ordered or random?

Example patterns

  • Event data(event_time, dimension)
  • User activity(user_id, event_time)
  • Multi-tenant analytics(tenant_id, timestamp)
  • Logs(timestamp, severity)

There is no universal answer – only workload-specific ones.

  • Copying schemas from OLTP databases
  • Leading with UUIDs by default
  • Optimizing only for compression
  • Ignoring ingestion order
  • Treating cardinality as the sole metric

Most performance issues stem from misunderstanding how the primary index works.

  • ORDER BY defines physical layout, not output order
  • Cardinality alone is not the deciding factor
  • Randomness is often the real problem
  • High-cardinality columns can be correct when they align with query patterns
  • Data types and selectivity matter
  • Always design ORDER BY around how data is queried

Understanding these principles allows you to move beyond rules of thumb and design schemas that perform predictably at scale.

At Quantrail Data, we help teams run ClickHouse reliably for real-time analytics from Kubernetes deployments and migrations to performance tuning in production.

We see these challenges firsthand while supporting demanding analytics workloads. In one recent engagement, a customer achieved near bare-metal performance with ClickHouse® in production a story we’ve shared here:

Success Story: Quantrail Bare-Metal ClickHouse® Deployment

If you’re evaluating ClickHouse® or trying to get more out of an existing setup, we’re happy to share practical lessons from real-world deployments.

Contact
Quantrail Data

ClickHouse Documentation – MergeTree and Primary Index
ClickHouse Documentation – Data Skipping Indexes