Databricks Publishing

LakeXpress creates Databricks Unity Catalog tables from exported Parquet files. Supports external tables (data stays in S3) and managed Delta tables (data loaded into Databricks).

Table of Contents

Prerequisites

1. Databricks Setup

Required Resources:

  • Databricks Workspace with Unity Catalog enabled
  • SQL Warehouse (for DDL execution)
  • Catalog with CREATE SCHEMA and CREATE TABLE permissions
  • Storage credential for S3 access (external tables only)

Required Permissions:

  • CREATE SCHEMA on the target catalog
  • CREATE TABLE on the target schema
  • USE CATALOG on the target catalog
  • For external tables: access to the external location

2. Storage Credential Setup

For external tables, configure a storage credential for S3 access:

-- In Databricks SQL
CREATE STORAGE CREDENTIAL my_s3_credential
WITH (
    AWS_IAM_ROLE = 'arn:aws:iam::123456789012:role/DatabricksS3AccessRole'
);

CREATE EXTERNAL LOCATION my_s3_location
URL 's3://my-datalake-bucket/lakexpress/'
WITH (STORAGE CREDENTIAL my_s3_credential);

3. Credentials Configuration

Both S3 and Databricks credentials are needed in credentials.json:

{
    "aws_s3": {
        "ds_type": "s3",
        "auth_mode": "profile",
        "info": {
            "directory": "s3://your-bucket-name/path/to/exports",
            "profile": "your-aws-profile"
        }
    },
    "databricks": {
        "ds_type": "databricks",
        "auth_mode": "token",
        "info": {
            "host": "your-workspace.cloud.databricks.com",
            "http_path": "/sql/1.0/warehouses/your-warehouse-id",
            "access_token": "your-access-token",
            "catalog": "your-catalog"
        }
    }
}

Authentication

Databricks uses Personal Access Tokens (PAT):

{
    "databricks": {
        "ds_type": "databricks",
        "auth_mode": "token",
        "info": {
            "host": "your-workspace.cloud.databricks.com",
            "http_path": "/sql/1.0/warehouses/your-warehouse-id",
            "access_token": "your-access-token",
            "catalog": "your-catalog"
        }
    }
}

Configuration Fields:

Field Description Required
host Databricks workspace hostname Yes
http_path SQL Warehouse HTTP path Yes
access_token Personal Access Token (PAT) Yes
catalog Target Unity Catalog Yes
schema Default schema No

To get the HTTP path:

  1. Go to Databricks SQL > SQL Warehouses
  2. Select your warehouse > Connection details
  3. Copy the HTTP path

To create a PAT:

  1. Go to User Settings > Developer > Access tokens
  2. Generate a new token

Table Types

External Tables

Reference Parquet data in S3. Databricks stores only metadata; queries read directly from S3.

  • No data loading time or Databricks storage costs
  • Data stays in place
  • Available right after export
lakexpress -a credentials.json \
    --source_db_auth_id postgres_prod \
    --target_storage_id aws_s3_datalake \
    --publish_target databricks_prod \
    --databricks_table_type external  # Default

Managed Tables (Delta)

Delta tables loaded into Databricks managed storage via COPY INTO. Enables ACID transactions, time travel, and Z-ordering.

  • Faster queries with optimized storage and caching
  • Full Delta Lake features
  • Better for frequently accessed data
lakexpress -a credentials.json \
    --source_db_auth_id postgres_prod \
    --target_storage_id aws_s3_datalake \
    --publish_target databricks_prod \
    --databricks_table_type managed

Configuration Options

Option Description Default
--publish_target ID Credential ID for Databricks target (required) -
--publish_schema_pattern PATTERN Schema naming pattern {schema}
--publish_table_pattern PATTERN Table naming pattern {table}
--databricks_table_type TYPE external or managed external
--databricks_catalog CATALOG Override catalog from credentials (from auth)
--n_jobs N Parallel workers for table creation 1

Dynamic Naming Patterns

Schema and table names support token-based patterns.

Supported Tokens

Token Description Example Output
{schema} Source schema name tpch_1
{table} Source table name customer
{database} Source database name tpch
{date} Current date (YYYYMMDD) 20251210
{timestamp} Current timestamp (YYYYMMDD_HHMMSS) 20251210_143022
{uuid} UUID4, consistent per run a1b2c3d4-...
{subpath} CLI --sub_path value staging

Common Patterns

Prefixed Schemas

lakexpress \
    --publish_schema_pattern "bronze_{schema}" \
    --publish_table_pattern "{table}" \
    --publish_target databricks_prod \
    ...

# Result:
# lakexpress_catalog.bronze_tpch_1.customer
# lakexpress_catalog.bronze_tpch_1.orders

Date-Based Schemas

lakexpress \
    --publish_schema_pattern "{schema}_{date}" \
    --publish_table_pattern "{table}" \
    --publish_target databricks_prod \
    ...

# Result:
# lakexpress_catalog.tpch_1_20251210.customer

Medallion Architecture

# Bronze layer (raw data)
lakexpress \
    --publish_schema_pattern "bronze_{schema}" \
    --databricks_table_type external \
    --publish_target databricks_prod \
    ...

# Silver layer (curated data)
lakexpress \
    --publish_schema_pattern "silver_{schema}" \
    --databricks_table_type managed \
    --publish_target databricks_prod \
    ...

Usage Examples

Example 1: External Tables (Data in S3)

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target databricks_prod \
    --databricks_table_type external

Result: External Parquet tables in lakexpress_catalog.public reading from S3.

Example 2: Managed Delta Tables

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target databricks_prod \
    --databricks_table_type managed \
    --n_jobs 4

Result: Delta tables in lakexpress_catalog.tpch_1 loaded via COPY INTO.

Example 3: Custom Catalog and Schema Pattern

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name sales \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target databricks_prod \
    --databricks_catalog analytics_catalog \
    --publish_schema_pattern "raw_{schema}" \
    --databricks_table_type external

Result: Tables in analytics_catalog.raw_sales (customer, orders, etc.).

Example 4: Parallel Table Creation

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target databricks_prod \
    --n_jobs 8

Data Type Mapping

Source types are mapped to Databricks types automatically.

PostgreSQL to Databricks

PostgreSQL Type Databricks Type
INTEGER, INT4 INT
BIGINT, INT8 BIGINT
SMALLINT, INT2 SMALLINT
NUMERIC(p,s) DECIMAL(p,s)
REAL, FLOAT4 FLOAT
DOUBLE PRECISION DOUBLE
VARCHAR(n), TEXT STRING
DATE DATE
TIMESTAMP TIMESTAMP_NTZ
TIMESTAMPTZ TIMESTAMP
BOOLEAN BOOLEAN
BYTEA BINARY

SQL Server to Databricks

SQL Server Type Databricks Type
INT INT
BIGINT BIGINT
SMALLINT SMALLINT
TINYINT TINYINT
DECIMAL(p,s) DECIMAL(p,s)
FLOAT DOUBLE
REAL FLOAT
VARCHAR(n), NVARCHAR(n) STRING
DATE DATE
DATETIME, DATETIME2 TIMESTAMP_NTZ
DATETIMEOFFSET TIMESTAMP
BIT BOOLEAN
VARBINARY BINARY

CLI Reference

Databricks Publishing Arguments

Option Type Description
--publish_target ID String Credential ID for Databricks publishing (required)
--publish_schema_pattern PATTERN String Schema naming pattern (default: {schema})
--publish_table_pattern PATTERN String Table naming pattern (default: {table})
--databricks_table_type TYPE Enum external or managed (default: external)
--databricks_catalog CATALOG String Override catalog from credentials
--n_jobs N Integer Parallel workers for table creation (default: 1)

Querying Tables

Databricks SQL:

SELECT * FROM lakexpress_catalog.tpch_1.customer LIMIT 10;

Databricks Notebooks (Python):

df = spark.table("lakexpress_catalog.tpch_1.customer")
df.display()

Databricks Notebooks (SQL):

%sql
SELECT * FROM lakexpress_catalog.tpch_1.customer LIMIT 10

Delta Table Features (Managed Tables Only)

Time Travel:

SELECT * FROM lakexpress_catalog.tpch_1.customer VERSION AS OF 1;

Table History:

DESCRIBE HISTORY lakexpress_catalog.tpch_1.customer;

Optimize:

OPTIMIZE lakexpress_catalog.tpch_1.customer;