All posts
ClickHouse® Data Sampling: Querying Billions of Rows Fast

ClickHouse® Data Sampling: Querying Billions of Rows Fast

July 1, 20267 min readSanjeev Kumar G
Share:

Modern analytics systems often store billions-or even trillions-of rows. While ClickHouse® is designed to process massive datasets efficiently, there are situations where you don't need an exact answer immediately.

Suppose you're exploring user behavior, monitoring application metrics, or checking whether today's conversion rate looks unusual. In these cases, an approximate answer that arrives in a fraction of the time may be more valuable than waiting for a perfectly accurate result.

This is where data sampling comes in.

Data sampling allows ClickHouse® to process only a subset of your data and produce approximate results. By reading less data, queries can consume fewer resources and often complete much faster.

However, sampling is not a universal performance optimization. It introduces a trade-off between speed and accuracy, and it only works efficiently when your table is designed to support it.

In this article, we'll explore how ClickHouse® data sampling works, when to use it, and when you should avoid it.


What Is Data Sampling?

Data sampling is the process of executing a query against only a portion of the dataset instead of scanning every row.

Imagine you have a table containing 10 billion events.

Without sampling:

  • ClickHouse® examines all 10 billion rows.

With sampling:

  • ClickHouse® may process approximately 10% of the data and estimate the result from that subset.

The obvious advantage is that less data usually means:

  • Lower disk I/O
  • Lower CPU usage
  • Faster query execution

The trade-off is equally important:

The results are approximate, not exact.

Whether that approximation is acceptable depends entirely on your use case.


How Sampling Works in ClickHouse®

One common misconception is that ClickHouse® randomly selects rows whenever you use the SAMPLE clause.

That's not how it works.

Instead, efficient sampling depends on how the table was created.

For a MergeTree table, you define a sampling key using the SAMPLE BY clause.

For example:

CREATE TABLE events
(
    user_id UInt64,
    event_time DateTime,
    event_name String
)
ENGINE = MergeTree
ORDER BY (user_id, event_time)
SAMPLE BY user_id;

The sampling key determines how rows are partitioned for sampling.

When you later execute:

SELECT count()
FROM events
SAMPLE 0.1;

How ClickHouse® uses SAMPLE BY CLAUSE?

ClickHouse® doesn't randomly inspect rows during query execution.

Instead, it uses the sampling key to deterministically read roughly 10% of the data.

This deterministic behavior has an important advantage:

Running the same query multiple times with the same sampling ratio returns the same subset of data, assuming the underlying data hasn't changed.


Why Does SAMPLE BY Matter?

The SAMPLE clause is only effective when the table has been designed with a sampling expression.

Without a sampling key, ClickHouse® cannot efficiently determine which portions of the data should be read for a sampled query.

In other words, adding SAMPLE 0.1 to every query doesn't automatically make it faster.

Table design is what enables efficient sampling.

This is why it's important to think about sampling requirements during schema design rather than after the data has already been ingested.


Understanding the SAMPLE Clause

The simplest syntax looks like this:

SELECT *
FROM events
SAMPLE 0.1;

Here, 0.1 represents approximately 10% of the sampled data.

Similarly:

SAMPLE 0.25

uses roughly 25%.

And:

SAMPLE 0.5

uses roughly half of the data.

Using a larger sampling ratio generally improves accuracy but requires reading more data.

Using a smaller ratio reduces the amount of data processed but increases the potential estimation error.

There is no universally "correct" sampling ratio-it depends on how much accuracy your application requires.


Choosing a Good Sampling Key

The sampling key has a significant impact on how representative your sample will be.

A good sampling key usually has:

  • High cardinality
  • Even distribution
  • Stable values

Common choices include:

  • user_id
  • session_id
  • device_id

These columns tend to distribute data more evenly across the dataset.

On the other hand, low-cardinality columns such as:

  • country
  • status
  • gender

may produce samples that are less representative, depending on the data distribution.

Choosing an appropriate sampling key is often just as important as choosing the sampling ratio itself.


Speed vs Accuracy

Sampling always involves a compromise.

Full ScanSampled Query
Exact resultsApproximate results
Reads all matching rowsReads only a subset
Higher resource usageLower resource usage
Better for reportingBetter for exploration

How accurate the sampled result is depends on several factors, including:

  • The sampling ratio
  • The sampling key
  • The distribution of your data
  • The type of aggregation being performed

For example, estimating an average from a representative sample may work well.

Estimating rare events from a very small sample may not.


When Should You Use Data Sampling?

Sampling is particularly useful when approximate answers are sufficient.

Typical use cases include:

  • Interactive analytics
  • Dashboard exploration
  • Product analytics
  • Trend detection
  • Monitoring large event streams
  • Early investigation before running exact queries

In these situations, reducing query latency is often more valuable than achieving perfect precision.


When Should You Avoid Sampling?

Some workloads require exact answers.

Sampling is generally not appropriate for:

  • Financial reporting
  • Billing systems
  • Compliance reports
  • Tax calculations
  • Exact customer counts
  • Regulatory reporting

Whenever correctness is more important than query speed, a full scan is usually the safer choice.


Performance Expectations

It's tempting to think that sampling automatically makes every query dramatically faster.

In reality, the performance improvement depends on several factors, including:

  • Table design
  • Sampling ratio
  • Query complexity
  • Data distribution
  • Storage layout
  • Hardware resources

Queries that already read only a small portion of the table due to selective filtering may see little benefit from sampling.

Similarly, very small tables often don't benefit enough for sampling to be worthwhile.

Rather than assuming a fixed performance improvement, it's best to benchmark sampled and non-sampled queries on your own workload.


Best Practices

If you plan to use sampling in production, consider the following guidelines:

  • Design your tables with an appropriate SAMPLE BY expression.
  • Choose a high-cardinality sampling key whenever possible.
  • Periodically compare sampled results against full scans to understand the expected error.
  • Increase the sampling ratio when greater accuracy is required.
  • Clearly label sampled metrics as approximate in dashboards and reports.

These practices help users understand the trade-offs and reduce the risk of making decisions based on misunderstood data.


Conclusion

ClickHouse® data sampling is a practical feature for accelerating analytical queries when approximate results are acceptable.

Rather than processing every row, ClickHouse® can read only a subset of the data, reducing the amount of work required for many exploratory workloads. When the table is designed with an appropriate SAMPLE BY expression and the sampling key represents the data well, sampling can significantly reduce query latency while using fewer system resources.

However, sampling is not a replacement for exact analytics. It introduces estimation error, requires thoughtful schema design, and isn't suitable for workloads where every row matters.

Like many performance optimizations, data sampling is most effective when used intentionally. Understanding its strengths and limitations will help you decide when faster approximate answers are the right trade-off—and when a full scan remains the better choice.

References

clickhouse docs

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: