Databricks Publishing

LakeXpress creates Databricks Unity Catalog tables from exported Parquet files. Supports external tables (data stays in S3) and managed Delta tables (data loaded into Databricks).

Prerequisites
Authentication
Table Types
Configuration Options
Dynamic Naming Patterns
Usage Examples
Data Type Mapping
CLI Reference

Prerequisites

1. Databricks Setup

Required Resources:

Databricks Workspace with Unity Catalog enabled
SQL Warehouse (for DDL execution)
Catalog with CREATE SCHEMA and CREATE TABLE permissions
Storage credential for S3 access (external tables only)

Required Permissions:

CREATE SCHEMA on the target catalog
CREATE TABLE on the target schema
USE CATALOG on the target catalog
For external tables: access to the external location

2. Storage Credential Setup

For external tables, configure a storage credential for S3 access:

-- In Databricks SQL
CREATE STORAGE CREDENTIAL my_s3_credential
WITH (
    AWS_IAM_ROLE = 'arn:aws:iam::123456789012:role/DatabricksS3AccessRole'
);

CREATE EXTERNAL LOCATION my_s3_location
URL 's3://my-datalake-bucket/lakexpress/'
WITH (STORAGE CREDENTIAL my_s3_credential);

3. Credentials Configuration

Both S3 and Databricks credentials are needed in credentials.json:

{
    "aws_s3": {
        "ds_type": "s3",
        "auth_mode": "profile",
        "info": {
            "directory": "s3://your-bucket-name/path/to/exports",
            "profile": "your-aws-profile"
        }
    },
    "databricks": {
        "ds_type": "databricks",
        "auth_mode": "token",
        "info": {
            "host": "your-workspace.cloud.databricks.com",
            "http_path": "/sql/1.0/warehouses/your-warehouse-id",
            "access_token": "your-access-token",
            "catalog": "your-catalog"
        }
    }
}

Authentication

Databricks uses Personal Access Tokens (PAT):

{
    "databricks": {
        "ds_type": "databricks",
        "auth_mode": "token",
        "info": {
            "host": "your-workspace.cloud.databricks.com",
            "http_path": "/sql/1.0/warehouses/your-warehouse-id",
            "access_token": "your-access-token",
            "catalog": "your-catalog"
        }
    }
}

Configuration Fields:

Field	Description	Required
`host`	Databricks workspace hostname	Yes
`http_path`	SQL Warehouse HTTP path	Yes
`access_token`	Personal Access Token (PAT)	Yes
`catalog`	Target Unity Catalog	Yes
`schema`	Default schema	No

To get the HTTP path:

Go to Databricks SQL > SQL Warehouses
Select your warehouse > Connection details
Copy the HTTP path

To create a PAT:

Go to User Settings > Developer > Access tokens
Generate a new token

Table Types

External Tables

Reference Parquet data in S3. Databricks stores only metadata; queries read directly from S3.

No data loading time or Databricks storage costs
Data stays in place
Available right after export

lakexpress -a credentials.json \
    --source_db_auth_id postgres_prod \
    --target_storage_id aws_s3_datalake \
    --publish_target databricks_prod \
    --databricks_table_type external  # Default

Managed Tables (Delta)

Delta tables loaded into Databricks managed storage via COPY INTO. Enables ACID transactions, time travel, and Z-ordering.

Faster queries with optimized storage and caching
Full Delta Lake features
Better for frequently accessed data

lakexpress -a credentials.json \
    --source_db_auth_id postgres_prod \
    --target_storage_id aws_s3_datalake \
    --publish_target databricks_prod \
    --databricks_table_type managed

Configuration Options

Option	Description	Default
`--publish_target ID`	Credential ID for Databricks target (required)	-
`--publish_schema_pattern PATTERN`	Schema naming pattern	`{schema}`
`--publish_table_pattern PATTERN`	Table naming pattern	`{table}`
`--databricks_table_type TYPE`	`external` or `managed`	`external`
`--databricks_catalog CATALOG`	Override catalog from credentials	(from auth)
`--n_jobs N`	Parallel workers for table creation	`1`

Dynamic Naming Patterns

Schema and table names support token-based patterns.

Supported Tokens

Token	Description	Example Output
`{schema}`	Source schema name	`tpch_1`
`{table}`	Source table name	`customer`
`{database}`	Source database name	`tpch`
`{date}`	Current date (YYYYMMDD)	`20251210`
`{timestamp}`	Current timestamp (YYYYMMDD_HHMMSS)	`20251210_143022`
`{uuid}`	UUID4, consistent per run	`a1b2c3d4-...`
`{subpath}`	CLI `--sub_path` value	`staging`

Common Patterns

Prefixed Schemas

lakexpress \
    --publish_schema_pattern "bronze_{schema}" \
    --publish_table_pattern "{table}" \
    --publish_target databricks_prod \
    ...

# Result:
# lakexpress_catalog.bronze_tpch_1.customer
# lakexpress_catalog.bronze_tpch_1.orders

Date-Based Schemas

lakexpress \
    --publish_schema_pattern "{schema}_{date}" \
    --publish_table_pattern "{table}" \
    --publish_target databricks_prod \
    ...

# Result:
# lakexpress_catalog.tpch_1_20251210.customer

Medallion Architecture

# Bronze layer (raw data)
lakexpress \
    --publish_schema_pattern "bronze_{schema}" \
    --databricks_table_type external \
    --publish_target databricks_prod \
    ...

# Silver layer (curated data)
lakexpress \
    --publish_schema_pattern "silver_{schema}" \
    --databricks_table_type managed \
    --publish_target databricks_prod \
    ...

Usage Examples

Example 1: External Tables (Data in S3)

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target databricks_prod \
    --databricks_table_type external

Result: External Parquet tables in lakexpress_catalog.public reading from S3.

Example 2: Managed Delta Tables

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target databricks_prod \
    --databricks_table_type managed \
    --n_jobs 4

Result: Delta tables in lakexpress_catalog.tpch_1 loaded via COPY INTO.

Example 3: Custom Catalog and Schema Pattern

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name sales \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target databricks_prod \
    --databricks_catalog analytics_catalog \
    --publish_schema_pattern "raw_{schema}" \
    --databricks_table_type external

Result: Tables in analytics_catalog.raw_sales (customer, orders, etc.).

Example 4: Parallel Table Creation

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target databricks_prod \
    --n_jobs 8

Data Type Mapping

Source types are mapped to Databricks types automatically.

PostgreSQL to Databricks

PostgreSQL Type	Databricks Type
`INTEGER`, `INT4`	`INT`
`BIGINT`, `INT8`	`BIGINT`
`SMALLINT`, `INT2`	`SMALLINT`
`NUMERIC(p,s)`	`DECIMAL(p,s)`
`REAL`, `FLOAT4`	`FLOAT`
`DOUBLE PRECISION`	`DOUBLE`
`VARCHAR(n)`, `TEXT`	`STRING`
`DATE`	`DATE`
`TIMESTAMP`	`TIMESTAMP_NTZ`
`TIMESTAMPTZ`	`TIMESTAMP`
`BOOLEAN`	`BOOLEAN`
`BYTEA`	`BINARY`

SQL Server to Databricks

SQL Server Type	Databricks Type
`INT`	`INT`
`BIGINT`	`BIGINT`
`SMALLINT`	`SMALLINT`
`TINYINT`	`TINYINT`
`DECIMAL(p,s)`	`DECIMAL(p,s)`
`FLOAT`	`DOUBLE`
`REAL`	`FLOAT`
`VARCHAR(n)`, `NVARCHAR(n)`	`STRING`
`DATE`	`DATE`
`DATETIME`, `DATETIME2`	`TIMESTAMP_NTZ`
`DATETIMEOFFSET`	`TIMESTAMP`
`BIT`	`BOOLEAN`
`VARBINARY`	`BINARY`

CLI Reference