Databricks Publishing
LakeXpress creates Databricks Unity Catalog tables from exported Parquet files. Supports external tables (data stays in S3) and managed Delta tables (data loaded into Databricks).
Table of Contents
- Prerequisites
- Authentication
- Table Types
- Configuration Options
- Dynamic Naming Patterns
- Usage Examples
- Data Type Mapping
- CLI Reference
Prerequisites
1. Databricks Setup
Required Resources:
- Databricks Workspace with Unity Catalog enabled
- SQL Warehouse (for DDL execution)
- Catalog with CREATE SCHEMA and CREATE TABLE permissions
- Storage credential for S3 access (external tables only)
Required Permissions:
CREATE SCHEMAon the target catalogCREATE TABLEon the target schemaUSE CATALOGon the target catalog- For external tables: access to the external location
2. Storage Credential Setup
For external tables, configure a storage credential for S3 access:
-- In Databricks SQL
CREATE STORAGE CREDENTIAL my_s3_credential
WITH (
AWS_IAM_ROLE = 'arn:aws:iam::123456789012:role/DatabricksS3AccessRole'
);
CREATE EXTERNAL LOCATION my_s3_location
URL 's3://my-datalake-bucket/lakexpress/'
WITH (STORAGE CREDENTIAL my_s3_credential);
3. Credentials Configuration
Both S3 and Databricks credentials are needed in credentials.json:
{
"aws_s3": {
"ds_type": "s3",
"auth_mode": "profile",
"info": {
"directory": "s3://your-bucket-name/path/to/exports",
"profile": "your-aws-profile"
}
},
"databricks": {
"ds_type": "databricks",
"auth_mode": "token",
"info": {
"host": "your-workspace.cloud.databricks.com",
"http_path": "/sql/1.0/warehouses/your-warehouse-id",
"access_token": "your-access-token",
"catalog": "your-catalog"
}
}
}
Authentication
Databricks uses Personal Access Tokens (PAT):
{
"databricks": {
"ds_type": "databricks",
"auth_mode": "token",
"info": {
"host": "your-workspace.cloud.databricks.com",
"http_path": "/sql/1.0/warehouses/your-warehouse-id",
"access_token": "your-access-token",
"catalog": "your-catalog"
}
}
}
Configuration Fields:
| Field | Description | Required |
|---|---|---|
host |
Databricks workspace hostname | Yes |
http_path |
SQL Warehouse HTTP path | Yes |
access_token |
Personal Access Token (PAT) | Yes |
catalog |
Target Unity Catalog | Yes |
schema |
Default schema | No |
To get the HTTP path:
- Go to Databricks SQL > SQL Warehouses
- Select your warehouse > Connection details
- Copy the HTTP path
To create a PAT:
- Go to User Settings > Developer > Access tokens
- Generate a new token
Table Types
External Tables
Reference Parquet data in S3. Databricks stores only metadata; queries read directly from S3.
- No data loading time or Databricks storage costs
- Data stays in place
- Available right after export
lakexpress -a credentials.json \
--source_db_auth_id postgres_prod \
--target_storage_id aws_s3_datalake \
--publish_target databricks_prod \
--databricks_table_type external # Default
Managed Tables (Delta)
Delta tables loaded into Databricks managed storage via COPY INTO. Enables ACID transactions, time travel, and Z-ordering.
- Faster queries with optimized storage and caching
- Full Delta Lake features
- Better for frequently accessed data
lakexpress -a credentials.json \
--source_db_auth_id postgres_prod \
--target_storage_id aws_s3_datalake \
--publish_target databricks_prod \
--databricks_table_type managed
Configuration Options
| Option | Description | Default |
|---|---|---|
--publish_target ID |
Credential ID for Databricks target (required) | - |
--publish_schema_pattern PATTERN |
Schema naming pattern | {schema} |
--publish_table_pattern PATTERN |
Table naming pattern | {table} |
--databricks_table_type TYPE |
external or managed |
external |
--databricks_catalog CATALOG |
Override catalog from credentials | (from auth) |
--n_jobs N |
Parallel workers for table creation | 1 |
Dynamic Naming Patterns
Schema and table names support token-based patterns.
Supported Tokens
| Token | Description | Example Output |
|---|---|---|
{schema} |
Source schema name | tpch_1 |
{table} |
Source table name | customer |
{database} |
Source database name | tpch |
{date} |
Current date (YYYYMMDD) | 20251210 |
{timestamp} |
Current timestamp (YYYYMMDD_HHMMSS) | 20251210_143022 |
{uuid} |
UUID4, consistent per run | a1b2c3d4-... |
{subpath} |
CLI --sub_path value |
staging |
Common Patterns
Prefixed Schemas
lakexpress \
--publish_schema_pattern "bronze_{schema}" \
--publish_table_pattern "{table}" \
--publish_target databricks_prod \
...
# Result:
# lakexpress_catalog.bronze_tpch_1.customer
# lakexpress_catalog.bronze_tpch_1.orders
Date-Based Schemas
lakexpress \
--publish_schema_pattern "{schema}_{date}" \
--publish_table_pattern "{table}" \
--publish_target databricks_prod \
...
# Result:
# lakexpress_catalog.tpch_1_20251210.customer
Medallion Architecture
# Bronze layer (raw data)
lakexpress \
--publish_schema_pattern "bronze_{schema}" \
--databricks_table_type external \
--publish_target databricks_prod \
...
# Silver layer (curated data)
lakexpress \
--publish_schema_pattern "silver_{schema}" \
--databricks_table_type managed \
--publish_target databricks_prod \
...
Usage Examples
Example 1: External Tables (Data in S3)
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name public \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target databricks_prod \
--databricks_table_type external
Result: External Parquet tables in lakexpress_catalog.public reading from S3.
Example 2: Managed Delta Tables
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name tpch_1 \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target databricks_prod \
--databricks_table_type managed \
--n_jobs 4
Result: Delta tables in lakexpress_catalog.tpch_1 loaded via COPY INTO.
Example 3: Custom Catalog and Schema Pattern
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name sales \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target databricks_prod \
--databricks_catalog analytics_catalog \
--publish_schema_pattern "raw_{schema}" \
--databricks_table_type external
Result: Tables in analytics_catalog.raw_sales (customer, orders, etc.).
Example 4: Parallel Table Creation
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name tpch_1 \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target databricks_prod \
--n_jobs 8
Data Type Mapping
Source types are mapped to Databricks types automatically.
PostgreSQL to Databricks
| PostgreSQL Type | Databricks Type |
|---|---|
INTEGER, INT4 |
INT |
BIGINT, INT8 |
BIGINT |
SMALLINT, INT2 |
SMALLINT |
NUMERIC(p,s) |
DECIMAL(p,s) |
REAL, FLOAT4 |
FLOAT |
DOUBLE PRECISION |
DOUBLE |
VARCHAR(n), TEXT |
STRING |
DATE |
DATE |
TIMESTAMP |
TIMESTAMP_NTZ |
TIMESTAMPTZ |
TIMESTAMP |
BOOLEAN |
BOOLEAN |
BYTEA |
BINARY |
SQL Server to Databricks
| SQL Server Type | Databricks Type |
|---|---|
INT |
INT |
BIGINT |
BIGINT |
SMALLINT |
SMALLINT |
TINYINT |
TINYINT |
DECIMAL(p,s) |
DECIMAL(p,s) |
FLOAT |
DOUBLE |
REAL |
FLOAT |
VARCHAR(n), NVARCHAR(n) |
STRING |
DATE |
DATE |
DATETIME, DATETIME2 |
TIMESTAMP_NTZ |
DATETIMEOFFSET |
TIMESTAMP |
BIT |
BOOLEAN |
VARBINARY |
BINARY |
CLI Reference
Databricks Publishing Arguments
| Option | Type | Description |
|---|---|---|
--publish_target ID |
String | Credential ID for Databricks publishing (required) |
--publish_schema_pattern PATTERN |
String | Schema naming pattern (default: {schema}) |
--publish_table_pattern PATTERN |
String | Table naming pattern (default: {table}) |
--databricks_table_type TYPE |
Enum | external or managed (default: external) |
--databricks_catalog CATALOG |
String | Override catalog from credentials |
--n_jobs N |
Integer | Parallel workers for table creation (default: 1) |
Querying Tables
Databricks SQL:
SELECT * FROM lakexpress_catalog.tpch_1.customer LIMIT 10;
Databricks Notebooks (Python):
df = spark.table("lakexpress_catalog.tpch_1.customer")
df.display()
Databricks Notebooks (SQL):
%sql
SELECT * FROM lakexpress_catalog.tpch_1.customer LIMIT 10
Delta Table Features (Managed Tables Only)
Time Travel:
SELECT * FROM lakexpress_catalog.tpch_1.customer VERSION AS OF 1;
Table History:
DESCRIBE HISTORY lakexpress_catalog.tpch_1.customer;
Optimize:
OPTIMIZE lakexpress_catalog.tpch_1.customer;