DuckLake Publishing

LakeXpress can automatically register exported Parquet files in a DuckLake catalog, enabling ACID transactions, time travel, and schema evolution without copying data.

Table of Contents

What is DuckLake?

DuckLake is an open table format that adds data warehouse capabilities to your data lake using DuckDB and a lightweight metadata catalog.

  • ACID Transactions: Atomic commits across multiple tables
  • Time Travel: Query data at any historical point in time
  • Schema Evolution: Add, rename, or modify columns without rewriting data
  • Multi-Engine Access: Query from DuckDB, Python, or any DuckDB-compatible tool

Supported catalog backends:

  • SQLite — File-based, suited for local development
  • PostgreSQL — Production-ready with concurrent access
  • MySQL — Alternative production catalog
  • DuckDB — Native DuckDB file as catalog

Prerequisites

1. Storage Backend

Parquet files must be accessible from the DuckLake catalog. Supported storage:

  • Local filesystem
  • AWS S3 (including S3-compatible: MinIO, OVH, etc.)
  • Google Cloud Storage (GCS)
  • Azure Blob Storage

2. Catalog Database

Choose a catalog backend:

  • SQLite: No setup required (file-based)
  • PostgreSQL: Requires a running PostgreSQL instance
  • MySQL: Requires a running MySQL instance
  • DuckDB: No setup required (file-based)

Catalog Backend Options

SQLite Catalog (Simplest)

Best for local development, single-user, and testing.

{
    "ducklake_sqlite": {
        "ds_type": "sqlite",
        "catalog_type": "ducklake",
        "info": {
            "filepath": "/path/to/ducklake_catalog.sqlite"
        }
    }
}

Best for multi-user access, production workloads, and concurrent queries.

{
    "ducklake_postgres": {
        "ds_type": "postgres",
        "catalog_type": "ducklake",
        "info": {
            "server": "localhost",
            "port": 5432,
            "database": "ducklake_catalog",
            "username": "postgres",
            "password": "your-password",
            "schema": "lakexpress"
        }
    }
}

The schema field sets the PostgreSQL search_path for organizing catalog tables.

MySQL Catalog

Best for environments already using MySQL.

{
    "ducklake_mysql": {
        "ds_type": "mysql",
        "catalog_type": "ducklake",
        "info": {
            "server": "localhost",
            "port": 3306,
            "database": "ducklake_catalog",
            "username": "root",
            "password": "your-password"
        }
    }
}

MySQL uses the database as the namespace (no separate schema concept).

DuckDB Catalog

Best for DuckDB-native workflows and embedded analytics.

{
    "ducklake_duckdb": {
        "ds_type": "duckdb",
        "catalog_type": "ducklake",
        "info": {
            "filepath": "/path/to/ducklake_catalog.duckdb"
        }
    }
}

Configuration Options

Full Credentials Example

credentials.json with S3 storage and PostgreSQL catalog:

{
    "log_db": {
        "ds_type": "postgres",
        "auth_mode": "password",
        "info": {
            "server": "localhost",
            "port": 5432,
            "database": "lakexpress_log",
            "username": "postgres",
            "password": "log-password"
        }
    },
    "source_db": {
        "ds_type": "postgres",
        "auth_mode": "password",
        "info": {
            "server": "source-server",
            "port": 5432,
            "database": "production",
            "username": "readonly_user",
            "password": "source-password"
        }
    },
    "s3_storage": {
        "ds_type": "s3",
        "auth_mode": "key",
        "info": {
            "bucket": "my-data-lake",
            "directory": "exports/",
            "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
            "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
            "region": "us-east-1"
        }
    },
    "ducklake_catalog": {
        "ds_type": "postgres",
        "catalog_type": "ducklake",
        "info": {
            "server": "localhost",
            "port": 5432,
            "database": "ducklake_catalog",
            "username": "postgres",
            "password": "catalog-password",
            "schema": "lakexpress"
        }
    }
}

Configuration Fields

Field Description Required
ds_type Catalog database type: sqlite, postgres, mysql, or duckdb Yes
catalog_type Must be "ducklake" to enable DuckLake publishing Yes
info.filepath Path to SQLite/DuckDB catalog file For SQLite/DuckDB
info.server Database server hostname For PostgreSQL/MySQL
info.port Database server port For PostgreSQL/MySQL
info.database Database name For PostgreSQL/MySQL
info.username Database username For PostgreSQL/MySQL
info.password Database password For PostgreSQL/MySQL
info.schema Schema name (PostgreSQL only) No

Dynamic Naming Patterns

Pattern variables control how source tables map to DuckLake table names:

Variable Description Example
{schema} Source schema name public
{table} Source table name orders
{database} Source database name production

Pattern Examples

Default (mirror source structure):

--publish_schema_pattern "{schema}" \
--publish_table_pattern "{table}"

Result: public.orderspublic.orders

Prefix with environment:

--publish_schema_pattern "prod_{schema}" \
--publish_table_pattern "{table}"

Result: public.ordersprod_public.orders

Flatten to single schema:

--publish_schema_pattern "datalake" \
--publish_table_pattern "{schema}_{table}"

Result: public.ordersdatalake.public_orders

Usage Examples

Basic Export and Publish

Step 1: Create Configuration

./LakeXpress config create \
  -a ./credentials.json \
  --log_db_auth_id log_db \
  --source_db_auth_id source_db \
  --source_db_name production \
  --source_schema_name public \
  --fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
  --target_storage_id s3_storage \
  --publish_auth_id ducklake_catalog \
  --n_jobs 4 \
  --fastbcp_p 2

Step 2: Run Sync

./LakeXpress sync

Local Filesystem with SQLite Catalog

{
    "local_storage": {
        "ds_type": "filesystem",
        "info": {
            "directory": "/data/exports"
        }
    },
    "ducklake_local": {
        "ds_type": "sqlite",
        "catalog_type": "ducklake",
        "info": {
            "filepath": "/data/ducklake_catalog.sqlite"
        }
    }
}
./LakeXpress config create \
  -a ./credentials.json \
  --log_db_auth_id log_db \
  --source_db_auth_id source_db \
  --source_db_name mydb \
  --source_schema_name public \
  --fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
  --target_storage_id local_storage \
  --publish_auth_id ducklake_local

S3-Compatible Storage (MinIO)

{
    "minio_storage": {
        "ds_type": "s3",
        "auth_mode": "key",
        "info": {
            "bucket": "data-lake",
            "directory": "exports/",
            "aws_access_key_id": "minioadmin",
            "aws_secret_access_key": "minioadmin",
            "region": "us-east-1",
            "endpoint": "http://localhost:9000"
        }
    }
}

Data Type Mapping

Source database types are mapped to DuckDB types during table registration:

Source Type DuckDB Type
INT, INTEGER INTEGER
BIGINT BIGINT
SMALLINT SMALLINT
TINYINT TINYINT
DECIMAL, NUMERIC DECIMAL(p,s)
FLOAT, REAL FLOAT
DOUBLE DOUBLE
VARCHAR, NVARCHAR VARCHAR
TEXT VARCHAR
CHAR, NCHAR VARCHAR
DATE DATE
TIME TIME
DATETIME, TIMESTAMP TIMESTAMP
BOOLEAN, BIT BOOLEAN
BINARY, VARBINARY BLOB
UUID, UNIQUEIDENTIFIER UUID

Querying Tables

After publishing, query your data with DuckDB.

Using DuckDB CLI

-- Attach the DuckLake catalog (DATA_PATH points to where Parquet files are stored)
ATTACH 'postgres:dbname=ducklake_catalog host=localhost user=postgres password=secret' AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/');

-- List tables
SHOW TABLES FROM lake.public;

-- Query data
SELECT * FROM lake.public.orders LIMIT 10;

-- Time travel: query historical version
SELECT * FROM lake.public.orders AS OF TIMESTAMP '2024-01-15 10:00:00';

Using Python

import duckdb

# Connect and attach catalog
# DATA_PATH should match the target_storage location from LakeXpress config
conn = duckdb.connect()
conn.execute("""
    ATTACH 'postgres:dbname=ducklake_catalog host=localhost user=postgres password=secret'
    AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/')
""")

# Query data
df = conn.execute("SELECT * FROM lake.public.orders").fetchdf()
print(df)

Using SQLite Catalog

-- Attach SQLite-based catalog (DATA_PATH points to where Parquet files are stored)
ATTACH '/path/to/ducklake_catalog.sqlite' AS lake (TYPE ducklake, DATA_PATH '/path/to/parquet/files/');

-- Query
SELECT * FROM lake.main.customers;

CLI Reference

Relevant CLI Options

Option Description
--publish_auth_id Credential ID for DuckLake catalog
--publish_schema_pattern Pattern for target schema name (default: {schema})
--publish_table_pattern Pattern for target table name (default: {table})
--n_jobs Number of parallel publish jobs

Full Command with All Options

./LakeXpress config create \
  -a ./credentials.json \
  --log_db_auth_id log_db \
  --source_db_auth_id source_pg \
  --source_db_name production \
  --source_schema_name public,sales,inventory \
  --fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
  --target_storage_id s3_datalake \
  --publish_auth_id ducklake_catalog \
  --publish_schema_pattern "bronze_{schema}" \
  --publish_table_pattern "{table}" \
  --n_jobs 8 \
  --fastbcp_p 4

Next Steps