Image Source
Photo by Ray McKay: https://www.pexels.com/photo/view-of-a-loaded-cargo-ship-19549941/
Apache Doris is an open-source data warehouse for real-time analytics, based on MPP architecture. As of 2024, it is one of the fastest growing analytical data stores and offers performance on par with industry leaders.
Apache Doris is an open-source data warehouse for real-time analytics based on MPP architecture. As of 2024, Apache Doris is one of the fastest growing analytical data store and the future looks exciting for Apache Doris.
Let us look at a simple example on loading a file from our local disk to Apache Doris. Kubernetes (Minikube) based setup is used for this example with one front end and one backend node. We have used Apache Doris 2.1.x for this example.
Apache Doris supports uploading a file from a remote server to the data store via stream load (which uses HTTP protocol). Stream load is an atomic process and a synchronous import method available in Apache Doris. We can use stream load to upload files that are upto 10 GB in size. Apache Doris supports CSV, JSON, Parquet, and ORC formats for stream loads. Let us use the Iris dataset for this example. The prerequisites for this example are
- Apache Doris Cluster
- Basic knowledge on Curl Command
Steps
Assuming that we have Apache Doris up and running, let us first connect to the server via web UI and create the necessary database and table.
CREATE DATABASE sample_datasets
CREATE TABLE sample_datasets.iris(
variety VARCHAR(20) NOT NULL,
sepal_length FLOAT NOT NULL,
sepal_width FLOAT,
petal_length FLOAT,
petal_width FLOAT
)
DUPLICATE KEY(variety)
DISTRIBUTED BY HASH(variety) BUCKETS 3
PROPERTIES ('replication_num' = '1');
We are going to use the Duplicate Key model, and the data is horizontally partitioned using hash buckets. Since there are only one replica, we have specified the replication number as 1 in the properties section.
Let us now upload the file using stream load. You can download the file below.
Once the file is downloaded, you can run the below command to perform the stream load in to an Apache Doris server. This curl command will submit a stream load job to the Apache Doris server.
We can send the request to the http port of the frontend node or the backend node directly.
curl --location-trusted -u root:"" \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns:variety,sepal_length,sepal_width,petal_length,petal_width" \
-T iris.csv \
-XPUT http://192.168.49.2:31006/api/sample_datasets/iris/_stream_load
You should see a success message similar to below.
Once the stream load is successful, let us query the data and verify.
So we have the inserted data available in the server. In the next part of this series, let us explore advanced stream load options available in Apache Doris.
References
https://doris.apache.org/docs/2.1/data-operate/import/stream-load-manual