Modern organizations generate massive amounts of data that need to be stored and analyzed efficiently. Amazon S3 provides a scalable and cost-effective foundation for data lakes, while ClickHouse® delivers high-speed analytics. By integrating ClickHouse® with S3, users can query data directly from their data lake without loading it into database tables, enabling faster insights and lower storage costs.
What Is Amazon S3?
Amazon Simple Storage Service (S3) is a cloud object storage service that allows organizations to store and retrieve large volumes of data. It is commonly used for storing CSV, JSON, Parquet, ORC, logs, backups, and other datasets that form the foundation of modern data lakes.
Key benefits of Amazon S3 include:
- Virtually unlimited storage capacity
- High durability and availability
- Cost-effective storage for large datasets
- Seamless integration with analytics tools
What Is a Data Lake?
A data lake is a centralized repository that stores structured, semi-structured, and unstructured data in its native format.
Unlike traditional databases that require predefined schemas, data lakes allow organizations to store raw data and process it later when needed.
Typical data lake content includes:
- Application logs
- Business transactions
- IoT sensor data
- Machine learning datasets
- Historical archives
Why Integrate ClickHouse® with S3?
Traditionally, data stored in cloud storage must be loaded into a database before it can be analyzed. This process consumes time, storage space, and computing resources.
With ClickHouse's native S3 integration, data can be queried directly from S3 without importing it into local ClickHouse tables.
Benefits include:
- Reduced storage duplication
- Faster analytics on large datasets
- Lower infrastructure costs
- Simplified data pipelines
- Flexible access to historical data
Querying S3 Data with ClickHouse
ClickHouse® provides the s3() table function that allows users to read files directly from Amazon S3.
Query a CSV File
SELECT *
FROM s3(
'https://my-bucket.s3.amazonaws.com/sales.csv',
'CSVWithNames'
) LIMIT 10;This query treats the CSV file as a virtual table and returns the first 10 rows.
Query a Parquet File
SELECT
customer_id,
SUM(amount) AS total_sales
FROM s3(
'https://my-bucket.s3.amazonaws.com/orders.parquet',
'Parquet'
)
GROUP BY customer_id
ORDER BY total_sales DESC;Parquet files are particularly efficient because ClickHouse® reads only the required columns.
Querying Multiple Files
Data lakes often store data across thousands of partitioned files.
ClickHouse supports wildcards to query multiple files simultaneously.
SELECT count()
FROM s3(
'https://my-bucket.s3.amazonaws.com/logs/2026/*.parquet',
'Parquet'
);This approach enables analytics across large datasets without manually combining files.
Loading Data from S3 into ClickHouse
For frequently accessed datasets, data can be loaded from Amazon S3 into ClickHouse® tables to improve query performance and reduce repeated reads from object storage.
Method 1: Create and Load in a Single Step
CREATE TABLE sales
ENGINE = MergeTree
ORDER BY customer_id AS
SELECT *
FROM s3(
'https://my-bucket.s3.amazonaws.com/sales.parquet',
'Parquet'
);This method creates the table and loads the data in a single query, making it useful for quick analysis and testing.
Method 2: Create Schema and Insert Data
Create the table first:
CREATE TABLE sales
(
customer_id UInt32,
order_id UInt64,
amount Float64,
order_date Date
)
ENGINE = MergeTree
ORDER BY customer_id;Then load data from S3:
INSERT INTO sales
SELECT *
FROM s3(
'https://my-bucket.s3.amazonaws.com/sales.parquet',
'Parquet'
);This method provides better control over schema design and is commonly used in production environments.
Benefits
- Improved query performance
- Better schema management
- Reduced S3 access costs
- Suitable for frequently queried datasets
Supported File Formats
ClickHouse can query several popular file formats stored in S3:
| Format | Use Case |
|---|---|
| CSV | General-purpose data exchange |
| JSON | Application and API data |
| Parquet | Analytics and data lakes |
| ORC | Big data processing |
| TSV | Tab-separated datasets |
Among these formats, Parquet is generally recommended for analytical workloads due to its columnar storage design.
Best Practices
When querying data lakes with ClickHouse® and S3:
- Use Parquet for better performance.
- Partition data by date or business dimensions.
- Query only the required columns.
- Compress files to reduce storage costs.
- Load frequently accessed data into local ClickHouse tables.
Common Use Cases
1. Log Analytics
Analyze application and server logs stored in S3.
2. Historical Reporting
Query archived business data without moving it into ClickHouse.
3. Data Warehousing
Use ClickHouse as a query engine on top of an S3-based data lake.
4. Business Intelligence
Power dashboards and reports directly from data lake storage.
Conclusion
ClickHouse® and Amazon S3 create a powerful foundation for modern data lake analytics. By enabling direct queries on data stored in S3, ClickHouse eliminates unnecessary data movement while delivering high-performance analytical processing. Whether analyzing logs, historical datasets, or business data, this integration helps organizations reduce costs, simplify data architectures, and scale analytics efficiently.



