Working with JSON in ClickHouse®

Mohamed Hussain S avatar
Working with JSON in ClickHouse®

JSON has become one of the most common formats for storing and exchanging data. Whether you’re ingesting application events, API payloads, logs, telemetry, or user activity streams, chances are you’ll encounter JSON at some point in your ClickHouse® journey.

The challenge is that JSON is inherently flexible, while analytical databases thrive on structure.

Over the years, ClickHouse® has evolved its JSON capabilities significantly. What started as simple JSON extraction from String columns has expanded to include native JSON support, making it easier to work with semi-structured data while still benefiting from ClickHouse®’s columnar architecture.

In this article, we’ll explore the different ways to work with JSON in ClickHouse®, when to use each approach, and the performance considerations you should keep in mind.

Modern applications generate large volumes of semi-structured data.

Common examples include:

  • Application logs
  • Event streams
  • API responses
  • User activity tracking
  • IoT telemetry
  • Observability data

A typical JSON document might look like:

{
  "user_id": 12345,
  "event_type": "login",
  "device": {
    "os": "Android",
    "version": "14"
  }
}

Unlike traditional relational schemas, JSON allows fields to evolve over time without requiring immediate schema changes.

This flexibility makes JSON attractive for ingestion pipelines, but it also introduces challenges when querying and analyzing data at scale.

Historically, one of the most common approaches was storing JSON as a String.

Example:

CREATE TABLE events
(
    timestamp DateTime,
    payload String
)
ENGINE = MergeTree
ORDER BY timestamp;

In this model, ClickHouse® stores the JSON document exactly as received.

Queries use JSON extraction functions to access individual fields.

Example:

SELECT
    JSONExtractString(payload, 'event_type')
FROM events;

Benefits:

  • Extremely flexible
  • No schema requirements
  • Easy ingestion

Drawbacks:

  • JSON must be parsed during query execution
  • Repeated extraction increases CPU usage
  • Filtering and aggregation become more expensive

This approach works well for early-stage ingestion but may become costly as query volumes increase.

ClickHouse® provides a rich set of JSON functions for extracting values from JSON documents.

Common examples include:

JSONExtractString()
JSONExtractInt()
JSONExtractFloat()
JSONExtractBool()
JSONExtract()

Example:

SELECT
    JSONExtractUInt(payload, 'user_id') AS user_id
FROM events;

These functions allow structured access without requiring a predefined schema.

However, every query still performs JSON parsing work, which can impact performance on large datasets.

To better support semi-structured workloads, ClickHouse® introduced a native JSON data type.

Example:

CREATE TABLE events
(
    timestamp DateTime,
    data JSON
)
ENGINE = MergeTree
ORDER BY timestamp;

Unlike String-based storage, the native JSON type understands the document structure internally.

This enables ClickHouse® to work with JSON more efficiently while preserving schema flexibility.

Accessing fields from a native JSON column is straightforward.

Example:

SELECT
    data.user_id,
    data.event_type
FROM events;

Nested fields can also be accessed naturally:

SELECT
    data.device.os
FROM events;

This syntax is often easier to read than multiple JSONExtract functions and reduces repetitive parsing work.

One of the key concepts behind the native JSON type is lazy parsing.

Instead of parsing an entire JSON document every time a query runs, ClickHouse® only processes the fields that are actually referenced.

For example:

SELECT
    data.user_id
FROM events;

Only the user_id path is accessed.

Unused fields remain untouched.

This approach is particularly beneficial when working with large JSON objects containing dozens or hundreds of attributes.

A common misconception is that native JSON eliminates the need for structured columns.

In reality, frequently queried fields should often become dedicated columns.

Consider fields used regularly in:

  • WHERE clauses
  • GROUP BY clauses
  • JOIN operations
  • Dashboards
  • Reporting queries

Example:

SELECT *
FROM events
WHERE data.user_id = 12345;

While this works, repeatedly filtering on JSON paths may still be more expensive than filtering on a dedicated column.

For heavily used attributes, modeling them explicitly often provides the best performance.

There is no single approach that fits every workload.

Use String + JSONExtract When:

  • Schemas change frequently
  • Query volumes are low
  • Data is primarily archived
  • JSON is rarely accessed

Use Native JSON When:

  • Schemas evolve over time
  • Only a subset of fields is queried
  • Flexibility is important
  • Semi-structured analytics are common

Using Dedicated Columns When:

  • Fields are frequently filtered
  • Fields participate in aggregations
  • Performance is critical
  • Query patterns are predictable

Many production systems use a combination of all three approaches.

A common production pattern looks like this:

CREATE TABLE events
(
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    metadata JSON
)
ENGINE = MergeTree
ORDER BY (user_id, timestamp);

In this design:

  • Frequently accessed fields become dedicated columns.
  • Less predictable attributes remain inside the JSON object.
  • Queries remain efficient.
  • Schema flexibility is preserved.

This often provides the best balance between performance and adaptability.

Storing Everything as JSON

Just because native JSON exists doesn’t mean every field should remain unstructured.

Hot query paths often belong in dedicated columns.

Excessive JSON Extraction

Repeatedly calling JSONExtract functions across large datasets can become CPU-intensive.

Ignoring Query Patterns

Design decisions should be based on how data is queried, not just how it is ingested.

Treating JSON as a Complete Schema Solution

JSON improves flexibility, but it does not eliminate the need for thoughtful data modeling.

When working with JSON in ClickHouse®:

  • Use native JSON for evolving schemas.
  • Promote frequently queried fields into dedicated columns.
  • Minimize repeated JSON parsing.
  • Design around actual query patterns.
  • Balance flexibility with performance.
  • Monitor query costs as JSON usage grows.
  • Use structured columns for critical analytical workloads.

Working with JSON in ClickHouse® is easier than ever thanks to the evolution of its JSON capabilities.

From traditional String-based storage and JSONExtract functions to the native JSON data type, ClickHouse® now offers multiple ways to handle semi-structured data efficiently.

The key is understanding that each approach serves a different purpose. Native JSON improves flexibility and reduces unnecessary parsing, while dedicated columns still deliver the best performance for heavily queried attributes.

The most successful ClickHouse® deployments rarely choose a single strategy. Instead, they combine structured columns and JSON thoughtfully, allowing schemas to evolve without sacrificing analytical performance.

At Quantrail Data, we help teams run ClickHouse® reliably for real-time analytics – from Kubernetes deployments and migrations to performance tuning in production.

We see these challenges firsthand while supporting demanding analytics workloads. In one recent engagement, a customer achieved near bare-metal performance with ClickHouse® in production – a story we’ve shared here:

Success Story: Quantrail Bare-Metal ClickHouse® Deployment
If you’re evaluating ClickHouse® or trying to get more out of an existing setup, we’re happy to share practical lessons from real-world deployments.

Contact
Quantrail Data

A Beginner’s Guide to JSON in ClickHouse®
Working with JSON in ClickHouse®: A Practical Decision Guide
JSON Type in ClickHouse®