MotherDuck Publishing

LakeXpress creates MotherDuck tables from exported Parquet files in cloud storage (S3, GCS, Azure).

Table of Contents

Prerequisites

1. MotherDuck Account

Sign up at motherduck.com and generate an access token:

  1. Log in to MotherDuck
  2. Go to Settings → Access Tokens
  3. Create a new token
  4. Copy the token for your credentials file

2. Cloud Storage Setup

Configure credentials for cloud storage and MotherDuck:

{
    "s3_datalake": {
        "ds_type": "s3",
        "auth_mode": "profile",
        "info": {
            "directory": "s3://my-datalake-bucket/lakexpress/",
            "profile": "my-aws-profile"
        }
    },
    "motherduck_prod": {
        "ds_type": "motherduck",
        "auth_mode": "token",
        "info": {
            "database": "my_analytics",
            "token": "your-motherduck-access-token"
        }
    }
}

3. Storage Access from MotherDuck (Required for External Views)

Important: When using external views (--publish_method external), MotherDuck needs credentials to read your Parquet files from cloud storage. Without this, you’ll see errors like:

IO Error: No files found that match the pattern "s3://..."
HTTP Error: Permission error: Missing or invalid credentials

Configure secrets in MotherDuck before querying external views:

For AWS S3 (most common):

CREATE SECRET aws_s3_secret (
    TYPE S3,
    KEY_ID 'your_aws_access_key_id',
    SECRET 'your_aws_secret_access_key',
    REGION 'us-east-1'  -- your bucket's region
);

To find your AWS credentials:

aws configure get aws_access_key_id
aws configure get aws_secret_access_key
aws configure get region

For GCS:

CREATE SECRET gcs_secret (
    TYPE GCS,
    KEY_ID 'your_hmac_access_key',
    SECRET 'your_hmac_secret_key'
);

For Azure:

CREATE SECRET azure_secret (
    TYPE AZURE,
    ACCOUNT_NAME 'your_storage_account',
    ACCOUNT_KEY 'your_account_key'
);

Verify secrets are configured:

SELECT * FROM duckdb_secrets();

Authentication

Store your token in the credentials file:

{
    "motherduck_prod": {
        "ds_type": "motherduck",
        "auth_mode": "token",
        "info": {
            "database": "my_analytics",
            "token": "your-motherduck-access-token"
        }
    }
}
Field Required Description
ds_type Yes Must be "motherduck"
auth_mode Yes Set to "token"
database Yes Target MotherDuck database name
token Yes MotherDuck access token

Environment Variable Authentication

Read the token from an environment variable:

{
    "motherduck_env": {
        "ds_type": "motherduck",
        "auth_mode": "env",
        "info": {
            "database": "my_analytics"
        }
    }
}
export motherduck_token="your-motherduck-access-token"
lakexpress ...
Field Required Description
ds_type Yes Must be "motherduck"
auth_mode Yes Set to "env"
database Yes Target MotherDuck database name

Configuration Options

Option Description Default
--publish_target ID Credential ID for MotherDuck target (required) -
--publish_schema_pattern PATTERN Schema naming pattern {schema}
--publish_table_pattern PATTERN Table naming pattern {table}
--publish_method TYPE Table type: external or internal external
--n_jobs N Parallel workers for table creation 1

Table Types

External Views (Default)

Queries data directly from cloud storage via read_parquet().

lakexpress ... --publish_target motherduck_prod --publish_method external

Generated SQL:

CREATE OR REPLACE VIEW "my_database"."tpch_1"."orders" AS
SELECT * FROM read_parquet('s3://bucket/exports/tpch_1/orders/*.parquet')

Pros:

  • No data copying, instant publishing
  • No MotherDuck storage costs
  • Always reflects latest cloud storage data
  • Best for large datasets, infrequent queries, exploration

Trade-offs:

  • Query speed depends on cloud storage latency
  • Requires cloud storage credentials in MotherDuck

Internal Tables

Loads data into MotherDuck native columnar storage.

lakexpress ... --publish_target motherduck_prod --publish_method internal

Generated SQL:

CREATE OR REPLACE TABLE "my_database"."tpch_1"."orders" AS
SELECT * FROM read_parquet('s3://bucket/exports/tpch_1/orders/*.parquet')

Pros:

  • Faster queries (optimized columnar format)
  • No storage credentials needed after load
  • Better caching for repeated queries
  • Best for dashboards and production analytics

Trade-offs:

  • Slower publishing (data must load)
  • MotherDuck storage costs apply
  • Data is a point-in-time snapshot

Dynamic Naming Patterns

Schema and table names support token-based patterns.

Supported Tokens

Token Description Example Output
{schema} Source schema name tpch_1
{table} Source table name customer
{database} Source database name tpch
{date} Current date (YYYYMMDD) 20251210
{timestamp} Current timestamp (YYYYMMDD_HHMMSS) 20251210_143022
{uuid} UUID4 identifier (consistent per run) a1b2c3d4
{subpath} CLI --sub_path parameter staging

Common Patterns

Prefixed Schemas

lakexpress \
  --publish_schema_pattern "lx_{schema}" \
  --publish_table_pattern "{table}" \
  --publish_target motherduck_prod \
  ...

# Results:
# Schema: lx_tpch_1
# Tables: customer, orders, lineitem

Date-Partitioned Schemas

lakexpress \
  --publish_schema_pattern "{schema}_{date}" \
  --publish_table_pattern "{table}" \
  --publish_target motherduck_prod \
  ...

# Results:
# Schema: tpch_1_20251210
# Tables: customer, orders, lineitem

Consolidated Multi-Schema

lakexpress \
  --source_schema_name schema1,schema2 \
  --publish_schema_pattern "consolidated" \
  --publish_table_pattern "{schema}_{table}" \
  --publish_target motherduck_prod \
  ...

# Results:
# Schema: consolidated
# Tables: schema1_customer, schema2_customer

Usage Examples

Example 1: Basic Export with External Views

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target motherduck_prod

Result: Database my_analytics, schema public, external views over S3 Parquet files.

Example 2: Internal Tables with Parallel Execution

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target motherduck_prod \
    --publish_method internal \
    --publish_schema_pattern "lx_{schema}" \
    --n_jobs 4

Result: Database my_analytics, schema lx_tpch_1, native tables loaded from S3.

Example 3: Daily Snapshots

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name sales \
    --target_storage_id s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target motherduck_prod \
    --publish_schema_pattern "sales_{date}" \
    --sub_path "daily/$(date +%Y%m%d)"

Result: Schema sales_20251210, data at s3://bucket/lakexpress/daily/20251210/.

Example 4: Environment Variable Token

export motherduck_token="your-token-here"

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target motherduck_env

Example 5: GCS Storage Backend

{
    "gcs_datalake": {
        "ds_type": "gcs",
        "auth_mode": "profile",
        "info": {
            "directory": "gs://my-datalake-bucket/lakexpress/",
            "profile": "/path/to/service-account.json"
        }
    },
    "motherduck_prod": {
        "ds_type": "motherduck",
        "auth_mode": "token",
        "info": {
            "database": "my_analytics",
            "token": "your-motherduck-token"
        }
    }
}
lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id gcs_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target motherduck_prod

Data Type Mapping

Source types map to DuckDB-compatible types automatically.

PostgreSQL → DuckDB/MotherDuck

PostgreSQL Type DuckDB Type
INTEGER, INT4 INTEGER
BIGINT, INT8 BIGINT
SMALLINT, INT2 SMALLINT
NUMERIC(p,s) DECIMAL(p,s)
REAL, FLOAT4 REAL
DOUBLE PRECISION DOUBLE
VARCHAR(n), TEXT VARCHAR
DATE DATE
TIMESTAMP TIMESTAMP
TIMESTAMPTZ TIMESTAMPTZ
TIME TIME
BOOLEAN BOOLEAN
BYTEA BLOB
JSON, JSONB JSON
UUID UUID

SQL Server → DuckDB/MotherDuck

SQL Server Type DuckDB Type
INT INTEGER
BIGINT BIGINT
SMALLINT SMALLINT
TINYINT TINYINT
DECIMAL(p,s) DECIMAL(p,s)
MONEY DECIMAL(19,4)
FLOAT DOUBLE
REAL REAL
VARCHAR(n), NVARCHAR(n) VARCHAR
DATE DATE
DATETIME, DATETIME2 TIMESTAMP
DATETIMEOFFSET TIMESTAMPTZ
TIME TIME
BIT BOOLEAN
VARBINARY BLOB
UNIQUEIDENTIFIER UUID

Oracle → DuckDB/MotherDuck

Oracle Type DuckDB Type
NUMBER DECIMAL(38,9)
NUMBER(p,s) DECIMAL(p,s)
FLOAT DOUBLE
VARCHAR2, NVARCHAR2 VARCHAR
CLOB, NCLOB VARCHAR
DATE TIMESTAMP
TIMESTAMP TIMESTAMP
RAW, BLOB BLOB

CLI Reference

MotherDuck Publishing Arguments

Option Type Description
--publish_target ID String Credential ID for MotherDuck publishing (required)
--publish_schema_pattern PATTERN String Dynamic schema naming pattern (default: {schema})
--publish_table_pattern PATTERN String Dynamic table naming pattern (default: {table})
--publish_method TYPE String Table type: external (default) or internal
--n_jobs N Integer Number of parallel workers for table creation (default: 1)

Querying MotherDuck Tables

Once published, query tables via any of these methods.

MotherDuck Web UI:

Log in to app.motherduck.com and run:

SELECT * FROM my_analytics.lx_tpch_1.customer LIMIT 10;

DuckDB CLI:

duckdb "md:my_analytics?motherduck_token=your_token"
SELECT * FROM lx_tpch_1.customer LIMIT 10;

Python (duckdb):

import duckdb

conn = duckdb.connect("md:my_analytics?motherduck_token=your_token")
df = conn.execute("SELECT * FROM lx_tpch_1.customer LIMIT 10").df()
print(df)

Python with environment variable:

import os
import duckdb

os.environ["motherduck_token"] = "your_token"
conn = duckdb.connect("md:my_analytics")
df = conn.execute("SELECT * FROM lx_tpch_1.customer LIMIT 10").df()
print(df)

Troubleshooting

Common Issues

“Authentication failed”:

  • Verify your token is valid and not expired
  • Check token permissions for the target database
  • For env mode, ensure motherduck_token is set

“Database not found”:

  • Verify the database name matches an existing MotherDuck database
  • Create the database in MotherDuck UI before publishing

“IO Error: No files found that match the pattern” or “Permission error: Missing or invalid credentials”:

  • This means MotherDuck cannot access your private S3/GCS/Azure bucket
  • Create a secret in MotherDuck with your cloud storage credentials (see Storage Access)
  • Verify secrets are configured: SELECT * FROM duckdb_secrets();
  • Test access: SELECT count(*) FROM read_parquet('s3://your-bucket/path/*.parquet');

“Cannot read file” (external views):

  • Verify cloud storage credentials in MotherDuck
  • Check that the S3/GCS/Azure path contains valid Parquet files
  • Ensure the bucket allows access from MotherDuck
  • Verify the files exist: aws s3 ls s3://your-bucket/path/ --recursive | head

“Connection timeout”:

  • Check network connectivity
  • MotherDuck requires outbound HTTPS
  • Verify no firewall is blocking the connection

Verifying Setup

Test MotherDuck connectivity:

duckdb "md:my_analytics?motherduck_token=your_token" -c "SELECT 1"

Test cloud storage access from MotherDuck:

SELECT * FROM read_parquet('s3://your-bucket/path/to/file.parquet') LIMIT 1;

Configuring Storage Secrets

See Storage Access from MotherDuck in Prerequisites for detailed instructions on configuring S3, GCS, or Azure secrets.

Comparison with Other Targets

Feature MotherDuck BigQuery Snowflake DuckLake
Query Engine DuckDB BigQuery Snowflake DuckDB
External Tables Views with read_parquet() External tables External tables DuckLake catalog
Native Tables CREATE TABLE AS SELECT Load jobs COPY INTO Not supported
Serverless Yes Yes No (warehouse required) N/A (local)
SQL Dialect DuckDB/PostgreSQL BigQuery SQL Snowflake SQL DuckDB/PostgreSQL
Best For Analytics, exploration Large-scale analytics Enterprise data warehouse Local development