DuckLake Publishing

LakeXpress can automatically register exported Parquet files in a DuckLake catalog, enabling ACID transactions, time travel, and schema evolution without copying data.

What is DuckLake?
Prerequisites
Catalog Backend Options
Configuration Options
Dynamic Naming Patterns
Usage Examples
Data Type Mapping
Querying Tables
CLI Reference

What is DuckLake?

DuckLake is an open table format that adds data warehouse capabilities to your data lake using DuckDB and a lightweight metadata catalog.

ACID Transactions: Atomic commits across multiple tables
Time Travel: Query data at any historical point in time
Schema Evolution: Add, rename, or modify columns without rewriting data
Multi-Engine Access: Query from DuckDB, Python, or any DuckDB-compatible tool

Supported catalog backends:

SQLite — File-based, suited for local development
PostgreSQL — Production-ready with concurrent access
MySQL — Alternative production catalog
DuckDB — Native DuckDB file as catalog

Prerequisites

1. Storage Backend

Parquet files must be accessible from the DuckLake catalog. Supported storage:

Local filesystem
AWS S3 (including S3-compatible: MinIO, OVH, etc.)
Google Cloud Storage (GCS)
Azure Blob Storage

2. Catalog Database

Choose a catalog backend:

SQLite: No setup required (file-based)
PostgreSQL: Requires a running PostgreSQL instance
MySQL: Requires a running MySQL instance
DuckDB: No setup required (file-based)

Catalog Backend Options

SQLite Catalog (Simplest)

Best for local development, single-user, and testing.

{
    "ducklake_sqlite": {
        "ds_type": "sqlite",
        "catalog_type": "ducklake",
        "info": {
            "filepath": "/path/to/ducklake_catalog.sqlite"
        }
    }
}

PostgreSQL Catalog (Recommended for Production)

Best for multi-user access, production workloads, and concurrent queries.

{
    "ducklake_postgres": {
        "ds_type": "postgres",
        "catalog_type": "ducklake",
        "info": {
            "server": "localhost",
            "port": 5432,
            "database": "ducklake_catalog",
            "username": "$env{PG_USER}",
            "password": "$env{PG_PASSWORD}",
            "schema": "lakexpress"
        }
    }
}

The schema field sets the PostgreSQL search_path for organizing catalog tables.

MySQL Catalog

Best for environments already using MySQL.

{
    "ducklake_mysql": {
        "ds_type": "mysql",
        "catalog_type": "ducklake",
        "info": {
            "server": "localhost",
            "port": 3306,
            "database": "ducklake_catalog",
            "username": "$env{MYSQL_USER}",
            "password": "$env{MYSQL_PASSWORD}"
        }
    }
}

MySQL uses the database as the namespace (no separate schema concept).

DuckDB Catalog

Best for DuckDB-native workflows and embedded analytics.

{
    "ducklake_duckdb": {
        "ds_type": "duckdb",
        "catalog_type": "ducklake",
        "info": {
            "filepath": "/path/to/ducklake_catalog.duckdb"
        }
    }
}

Configuration Options

Full Credentials Example

credentials.json with S3 storage and PostgreSQL catalog:

{
    "log_db": {
        "ds_type": "postgres",
        "auth_mode": "password",
        "info": {
            "server": "localhost",
            "port": 5432,
            "database": "lakexpress_log",
            "username": "$env{LOG_DB_USER}",
            "password": "$env{LOG_DB_PASSWORD}"
        }
    },
    "source_db": {
        "ds_type": "postgres",
        "auth_mode": "password",
        "info": {
            "server": "source-server",
            "port": 5432,
            "database": "production",
            "username": "$env{PG_USER}",
            "password": "$env{PG_PASSWORD}"
        }
    },
    "s3_storage": {
        "ds_type": "s3",
        "auth_mode": "key",
        "info": {
            "bucket": "my-data-lake",
            "directory": "exports/",
            "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
            "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
            "region": "us-east-1"
        }
    },
    "ducklake_catalog": {
        "ds_type": "postgres",
        "catalog_type": "ducklake",
        "info": {
            "server": "localhost",
            "port": 5432,
            "database": "ducklake_catalog",
            "username": "$env{PG_USER}",
            "password": "$env{PG_PASSWORD}",
            "schema": "lakexpress"
        }
    }
}

Configuration Fields

Field	Description	Required
`ds_type`	Catalog database type: `sqlite`, `postgres`, `mysql`, or `duckdb`	Yes
`catalog_type`	Must be `"ducklake"` to enable DuckLake publishing	Yes
`info.filepath`	Path to SQLite/DuckDB catalog file	For SQLite/DuckDB
`info.server`	Database server hostname	For PostgreSQL/MySQL
`info.port`	Database server port	For PostgreSQL/MySQL
`info.database`	Database name	For PostgreSQL/MySQL
`info.username`	Database username	For PostgreSQL/MySQL
`info.password`	Database password	For PostgreSQL/MySQL
`info.schema`	Schema name (PostgreSQL only)	No

Dynamic Naming Patterns

Pattern variables control how source tables map to DuckLake table names:

Variable	Description	Example
`{schema}`	Source schema name	`public`
`{table}`	Source table name	`orders`
`{database}`	Source database name	`production`

Pattern Examples

Default (mirror source structure):

--publish_schema_pattern "{schema}" \
--publish_table_pattern "{table}"

Result: public.orders → public.orders

Prefix with environment:

--publish_schema_pattern "prod_{schema}" \
--publish_table_pattern "{table}"

Result: public.orders → prod_public.orders

Flatten to single schema:

--publish_schema_pattern "datalake" \
--publish_table_pattern "{schema}_{table}"

Result: public.orders → datalake.public_orders

Usage Examples

Basic Export and Publish

Step 1: Create Configuration

./LakeXpress config create \
  -a ./credentials.json \
  --log_db_auth_id log_db \
  --source_db_auth_id source_db \
  --source_db_name production \
  --source_schema_name public \
  --fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
  --target_storage_id s3_storage \
  --publish_auth_id ducklake_catalog \
  --n_jobs 4 \
  --fastbcp_p 2

Step 2: Run Sync

./LakeXpress sync

Local Filesystem with SQLite Catalog

{
    "local_storage": {
        "ds_type": "filesystem",
        "info": {
            "directory": "/data/exports"
        }
    },
    "ducklake_local": {
        "ds_type": "sqlite",
        "catalog_type": "ducklake",
        "info": {
            "filepath": "/data/ducklake_catalog.sqlite"
        }
    }
}

./LakeXpress config create \
  -a ./credentials.json \
  --log_db_auth_id log_db \
  --source_db_auth_id source_db \
  --source_db_name mydb \
  --source_schema_name public \
  --fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
  --target_storage_id local_storage \
  --publish_auth_id ducklake_local

S3-Compatible Storage (MinIO)

{
    "minio_storage": {
        "ds_type": "s3",
        "auth_mode": "key",
        "info": {
            "bucket": "data-lake",
            "directory": "exports/",
            "aws_access_key_id": "minioadmin",
            "aws_secret_access_key": "minioadmin",
            "region": "us-east-1",
            "endpoint": "http://localhost:9000"
        }
    }
}

Data Type Mapping

Source database types are mapped to DuckDB types during table registration:

Source Type	DuckDB Type
`INT`, `INTEGER`	`INTEGER`
`BIGINT`	`BIGINT`
`SMALLINT`	`SMALLINT`
`TINYINT`	`TINYINT`
`DECIMAL`, `NUMERIC`	`DECIMAL(p,s)`
`FLOAT`, `REAL`	`FLOAT`
`DOUBLE`	`DOUBLE`
`VARCHAR`, `NVARCHAR`	`VARCHAR`
`TEXT`	`VARCHAR`
`CHAR`, `NCHAR`	`VARCHAR`
`DATE`	`DATE`
`TIME`	`TIME`
`DATETIME`, `TIMESTAMP`	`TIMESTAMP`
`BOOLEAN`, `BIT`	`BOOLEAN`
`BINARY`, `VARBINARY`	`BLOB`
`UUID`, `UNIQUEIDENTIFIER`	`UUID`

Querying Tables

After publishing, query your data with DuckDB.

Using DuckDB CLI

-- Attach the DuckLake catalog (DATA_PATH points to where Parquet files are stored)
ATTACH 'postgres:dbname=ducklake_catalog host=localhost user=postgres password=secret' AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/');

-- List tables
SHOW TABLES FROM lake.public;

-- Query data
SELECT * FROM lake.public.orders LIMIT 10;

-- Time travel: query historical version
SELECT * FROM lake.public.orders AS OF TIMESTAMP '2024-01-15 10:00:00';

Using Python

import duckdb

# Connect and attach catalog
# DATA_PATH should match the target_storage location from LakeXpress config
conn = duckdb.connect()
conn.execute("""
    ATTACH 'postgres:dbname=ducklake_catalog host=localhost user=postgres password=secret'
    AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/')
""")

# Query data
df = conn.execute("SELECT * FROM lake.public.orders").fetchdf()
print(df)

Using SQLite Catalog

-- Attach SQLite-based catalog (DATA_PATH points to where Parquet files are stored)
ATTACH '/path/to/ducklake_catalog.sqlite' AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/');

-- Query
SELECT * FROM lake.main.customers;

CLI Reference

Relevant CLI Options

Option	Description
`--publish_auth_id`	Credential ID for DuckLake catalog
`--publish_schema_pattern`	Pattern for target schema name (default: `{schema}`)
`--publish_table_pattern`	Pattern for target table name (default: `{table}`)
`--n_jobs`	Number of parallel publish jobs

Full Command with All Options

./LakeXpress config create \
  -a ./credentials.json \
  --log_db_auth_id log_db \
  --source_db_auth_id source_pg \
  --source_db_name production \
  --source_schema_name public,sales,inventory \
  --fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
  --target_storage_id s3_datalake \
  --publish_auth_id ducklake_catalog \
  --publish_schema_pattern "bronze_{schema}" \
  --publish_table_pattern "{table}" \
  --n_jobs 8 \
  --fastbcp_p 4

Next Steps

Storage Backends - Configure S3, GCS, or Azure storage
Incremental Sync - Set up delta exports
CLI Reference - Full command documentation