ClickHouse® and S3 Integration: Querying Data Lakes

Modern organizations generate massive amounts of data that need to be stored and analyzed efficiently. Amazon S3 provides a scalable and cost-effective foundation for data lakes, while ClickHouse® delivers high-speed analytics. By integrating ClickHouse® with S3, users can query data directly from their data lake without loading it into database tables, enabling faster insights and lower storage costs.

What Is Amazon S3?

Amazon Simple Storage Service (S3) is a cloud object storage service that allows organizations to store and retrieve large volumes of data. It is commonly used for storing CSV, JSON, Parquet, ORC, logs, backups, and other datasets that form the foundation of modern data lakes.

Key benefits of Amazon S3 include:

Virtually unlimited storage capacity
High durability and availability
Cost-effective storage for large datasets
Seamless integration with analytics tools

What Is a Data Lake?

A data lake is a centralized repository that stores structured, semi-structured, and unstructured data in its native format.

Unlike traditional databases that require predefined schemas, data lakes allow organizations to store raw data and process it later when needed.

Typical data lake content includes:

Application logs
Business transactions
IoT sensor data
Machine learning datasets
Historical archives

Why Integrate ClickHouse® with S3?

Traditionally, data stored in cloud storage must be loaded into a database before it can be analyzed. This process consumes time, storage space, and computing resources.

With ClickHouse's native S3 integration, data can be queried directly from S3 without importing it into local ClickHouse tables.

Benefits include:

Reduced storage duplication
Faster analytics on large datasets
Lower infrastructure costs
Simplified data pipelines
Flexible access to historical data

Querying S3 Data with ClickHouse

ClickHouse® provides the s3() table function that allows users to read files directly from Amazon S3.

Query a CSV File

SELECT *
FROM s3(
    'https://my-bucket.s3.amazonaws.com/sales.csv',
    'CSVWithNames'
) LIMIT 10;

This query treats the CSV file as a virtual table and returns the first 10 rows.

Query a Parquet File

SELECT
    customer_id,
    SUM(amount) AS total_sales
FROM s3(
    'https://my-bucket.s3.amazonaws.com/orders.parquet',
    'Parquet'
)
GROUP BY customer_id
ORDER BY total_sales DESC;

Parquet files are particularly efficient because ClickHouse® reads only the required columns.

Querying Multiple Files

Data lakes often store data across thousands of partitioned files.

ClickHouse supports wildcards to query multiple files simultaneously.

SELECT count()
FROM s3(
    'https://my-bucket.s3.amazonaws.com/logs/2026/*.parquet',
    'Parquet'
);

This approach enables analytics across large datasets without manually combining files.

Loading Data from S3 into ClickHouse

For frequently accessed datasets, data can be loaded from Amazon S3 into ClickHouse® tables to improve query performance and reduce repeated reads from object storage.

Method 1: Create and Load in a Single Step

CREATE TABLE sales
ENGINE = MergeTree
ORDER BY customer_id AS
SELECT *
FROM s3(
    'https://my-bucket.s3.amazonaws.com/sales.parquet',
    'Parquet'
);

This method creates the table and loads the data in a single query, making it useful for quick analysis and testing.

Method 2: Create Schema and Insert Data

Create the table first:

CREATE TABLE sales
(
    customer_id UInt32,
    order_id UInt64,
    amount Float64,
    order_date Date
)
ENGINE = MergeTree
ORDER BY customer_id;

Then load data from S3:

INSERT INTO sales
SELECT *
FROM s3(
    'https://my-bucket.s3.amazonaws.com/sales.parquet',
    'Parquet'
);

This method provides better control over schema design and is commonly used in production environments.

Benefits

Improved query performance
Better schema management
Reduced S3 access costs
Suitable for frequently queried datasets

Supported File Formats

ClickHouse can query several popular file formats stored in S3:

Format	Use Case
CSV	General-purpose data exchange
JSON	Application and API data
Parquet	Analytics and data lakes
ORC	Big data processing
TSV	Tab-separated datasets

Among these formats, Parquet is generally recommended for analytical workloads due to its columnar storage design.

Best Practices

When querying data lakes with ClickHouse® and S3:

Use Parquet for better performance.
Partition data by date or business dimensions.
Query only the required columns.
Compress files to reduce storage costs.
Load frequently accessed data into local ClickHouse tables.

Common Use Cases

1. Log Analytics

Analyze application and server logs stored in S3.

2. Historical Reporting

Query archived business data without moving it into ClickHouse.

3. Data Warehousing

Use ClickHouse as a query engine on top of an S3-based data lake.

4. Business Intelligence

Power dashboards and reports directly from data lake storage.

Conclusion

ClickHouse® and Amazon S3 create a powerful foundation for modern data lake analytics. By enabling direct queries on data stored in S3, ClickHouse eliminates unnecessary data movement while delivering high-performance analytical processing. Whether analyzing logs, historical datasets, or business data, this integration helps organizations reduce costs, simplify data architectures, and scale analytics efficiently.