DuckLake Publishing
LakeXpress can automatically register exported Parquet files in a DuckLake catalog, enabling ACID transactions, time travel, and schema evolution without copying data.
Table of Contents
- What is DuckLake?
- Prerequisites
- Catalog Backend Options
- Configuration Options
- Dynamic Naming Patterns
- Usage Examples
- Data Type Mapping
- Querying Tables
- CLI Reference
What is DuckLake?
DuckLake is an open table format that adds data warehouse capabilities to your data lake using DuckDB and a lightweight metadata catalog.
- ACID Transactions: Atomic commits across multiple tables
- Time Travel: Query data at any historical point in time
- Schema Evolution: Add, rename, or modify columns without rewriting data
- Multi-Engine Access: Query from DuckDB, Python, or any DuckDB-compatible tool
Supported catalog backends:
- SQLite — File-based, suited for local development
- PostgreSQL — Production-ready with concurrent access
- MySQL — Alternative production catalog
- DuckDB — Native DuckDB file as catalog
Prerequisites
1. Storage Backend
Parquet files must be accessible from the DuckLake catalog. Supported storage:
- Local filesystem
- AWS S3 (including S3-compatible: MinIO, OVH, etc.)
- Google Cloud Storage (GCS)
- Azure Blob Storage
2. Catalog Database
Choose a catalog backend:
- SQLite: No setup required (file-based)
- PostgreSQL: Requires a running PostgreSQL instance
- MySQL: Requires a running MySQL instance
- DuckDB: No setup required (file-based)
Catalog Backend Options
SQLite Catalog (Simplest)
Best for local development, single-user, and testing.
{
"ducklake_sqlite": {
"ds_type": "sqlite",
"catalog_type": "ducklake",
"info": {
"filepath": "/path/to/ducklake_catalog.sqlite"
}
}
}
PostgreSQL Catalog (Recommended for Production)
Best for multi-user access, production workloads, and concurrent queries.
{
"ducklake_postgres": {
"ds_type": "postgres",
"catalog_type": "ducklake",
"info": {
"server": "localhost",
"port": 5432,
"database": "ducklake_catalog",
"username": "postgres",
"password": "your-password",
"schema": "lakexpress"
}
}
}
The schema field sets the PostgreSQL search_path for organizing catalog tables.
MySQL Catalog
Best for environments already using MySQL.
{
"ducklake_mysql": {
"ds_type": "mysql",
"catalog_type": "ducklake",
"info": {
"server": "localhost",
"port": 3306,
"database": "ducklake_catalog",
"username": "root",
"password": "your-password"
}
}
}
MySQL uses the database as the namespace (no separate schema concept).
DuckDB Catalog
Best for DuckDB-native workflows and embedded analytics.
{
"ducklake_duckdb": {
"ds_type": "duckdb",
"catalog_type": "ducklake",
"info": {
"filepath": "/path/to/ducklake_catalog.duckdb"
}
}
}
Configuration Options
Full Credentials Example
credentials.json with S3 storage and PostgreSQL catalog:
{
"log_db": {
"ds_type": "postgres",
"auth_mode": "password",
"info": {
"server": "localhost",
"port": 5432,
"database": "lakexpress_log",
"username": "postgres",
"password": "log-password"
}
},
"source_db": {
"ds_type": "postgres",
"auth_mode": "password",
"info": {
"server": "source-server",
"port": 5432,
"database": "production",
"username": "readonly_user",
"password": "source-password"
}
},
"s3_storage": {
"ds_type": "s3",
"auth_mode": "key",
"info": {
"bucket": "my-data-lake",
"directory": "exports/",
"aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
"aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"region": "us-east-1"
}
},
"ducklake_catalog": {
"ds_type": "postgres",
"catalog_type": "ducklake",
"info": {
"server": "localhost",
"port": 5432,
"database": "ducklake_catalog",
"username": "postgres",
"password": "catalog-password",
"schema": "lakexpress"
}
}
}
Configuration Fields
| Field | Description | Required |
|---|---|---|
ds_type |
Catalog database type: sqlite, postgres, mysql, or duckdb |
Yes |
catalog_type |
Must be "ducklake" to enable DuckLake publishing |
Yes |
info.filepath |
Path to SQLite/DuckDB catalog file | For SQLite/DuckDB |
info.server |
Database server hostname | For PostgreSQL/MySQL |
info.port |
Database server port | For PostgreSQL/MySQL |
info.database |
Database name | For PostgreSQL/MySQL |
info.username |
Database username | For PostgreSQL/MySQL |
info.password |
Database password | For PostgreSQL/MySQL |
info.schema |
Schema name (PostgreSQL only) | No |
Dynamic Naming Patterns
Pattern variables control how source tables map to DuckLake table names:
| Variable | Description | Example |
|---|---|---|
{schema} |
Source schema name | public |
{table} |
Source table name | orders |
{database} |
Source database name | production |
Pattern Examples
Default (mirror source structure):
--publish_schema_pattern "{schema}" \
--publish_table_pattern "{table}"
Result: public.orders → public.orders
Prefix with environment:
--publish_schema_pattern "prod_{schema}" \
--publish_table_pattern "{table}"
Result: public.orders → prod_public.orders
Flatten to single schema:
--publish_schema_pattern "datalake" \
--publish_table_pattern "{schema}_{table}"
Result: public.orders → datalake.public_orders
Usage Examples
Basic Export and Publish
Step 1: Create Configuration
./LakeXpress config create \
-a ./credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id source_db \
--source_db_name production \
--source_schema_name public \
--fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
--target_storage_id s3_storage \
--publish_auth_id ducklake_catalog \
--n_jobs 4 \
--fastbcp_p 2
Step 2: Run Sync
./LakeXpress sync
Local Filesystem with SQLite Catalog
{
"local_storage": {
"ds_type": "filesystem",
"info": {
"directory": "/data/exports"
}
},
"ducklake_local": {
"ds_type": "sqlite",
"catalog_type": "ducklake",
"info": {
"filepath": "/data/ducklake_catalog.sqlite"
}
}
}
./LakeXpress config create \
-a ./credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id source_db \
--source_db_name mydb \
--source_schema_name public \
--fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
--target_storage_id local_storage \
--publish_auth_id ducklake_local
S3-Compatible Storage (MinIO)
{
"minio_storage": {
"ds_type": "s3",
"auth_mode": "key",
"info": {
"bucket": "data-lake",
"directory": "exports/",
"aws_access_key_id": "minioadmin",
"aws_secret_access_key": "minioadmin",
"region": "us-east-1",
"endpoint": "http://localhost:9000"
}
}
}
Data Type Mapping
Source database types are mapped to DuckDB types during table registration:
| Source Type | DuckDB Type |
|---|---|
INT, INTEGER |
INTEGER |
BIGINT |
BIGINT |
SMALLINT |
SMALLINT |
TINYINT |
TINYINT |
DECIMAL, NUMERIC |
DECIMAL(p,s) |
FLOAT, REAL |
FLOAT |
DOUBLE |
DOUBLE |
VARCHAR, NVARCHAR |
VARCHAR |
TEXT |
VARCHAR |
CHAR, NCHAR |
VARCHAR |
DATE |
DATE |
TIME |
TIME |
DATETIME, TIMESTAMP |
TIMESTAMP |
BOOLEAN, BIT |
BOOLEAN |
BINARY, VARBINARY |
BLOB |
UUID, UNIQUEIDENTIFIER |
UUID |
Querying Tables
After publishing, query your data with DuckDB.
Using DuckDB CLI
-- Attach the DuckLake catalog (DATA_PATH points to where Parquet files are stored)
ATTACH 'postgres:dbname=ducklake_catalog host=localhost user=postgres password=secret' AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/');
-- List tables
SHOW TABLES FROM lake.public;
-- Query data
SELECT * FROM lake.public.orders LIMIT 10;
-- Time travel: query historical version
SELECT * FROM lake.public.orders AS OF TIMESTAMP '2024-01-15 10:00:00';
Using Python
import duckdb
# Connect and attach catalog
# DATA_PATH should match the target_storage location from LakeXpress config
conn = duckdb.connect()
conn.execute("""
ATTACH 'postgres:dbname=ducklake_catalog host=localhost user=postgres password=secret'
AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/')
""")
# Query data
df = conn.execute("SELECT * FROM lake.public.orders").fetchdf()
print(df)
Using SQLite Catalog
-- Attach SQLite-based catalog (DATA_PATH points to where Parquet files are stored)
ATTACH '/path/to/ducklake_catalog.sqlite' AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/');
-- Query
SELECT * FROM lake.main.customers;
CLI Reference
Relevant CLI Options
| Option | Description |
|---|---|
--publish_auth_id |
Credential ID for DuckLake catalog |
--publish_schema_pattern |
Pattern for target schema name (default: {schema}) |
--publish_table_pattern |
Pattern for target table name (default: {table}) |
--n_jobs |
Number of parallel publish jobs |
Full Command with All Options
./LakeXpress config create \
-a ./credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id source_pg \
--source_db_name production \
--source_schema_name public,sales,inventory \
--fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
--target_storage_id s3_datalake \
--publish_auth_id ducklake_catalog \
--publish_schema_pattern "bronze_{schema}" \
--publish_table_pattern "{table}" \
--n_jobs 8 \
--fastbcp_p 4
Next Steps
- Storage Backends - Configure S3, GCS, or Azure storage
- Incremental Sync - Set up delta exports
- CLI Reference - Full command documentation