AWS Glue Publishing

LakeXpress creates AWS Glue Data Catalog tables from exported Parquet files, enabling queries via Athena, Redshift Spectrum, and EMR.

Table of Contents

Prerequisites

1. AWS Glue Permissions

Required IAM permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:GetDatabase",
                "glue:DeleteDatabase",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:DeleteTable",
                "glue:UpdateTable",
                "glue:GetTables",
                "glue:BatchCreatePartition",
                "glue:GetPartitions"
            ],
            "Resource": "*"
        }
    ]
}

2. S3 Storage Setup

Both S3 and Glue credentials are required in credentials.json:

{
    "aws_s3_datalake": {
        "ds_type": "s3",
        "auth_mode": "profile",
        "info": {
            "directory": "s3://my-datalake-bucket/lakexpress/",
            "profile": "my-aws-profile"
        }
    },
    "glue_catalog": {
        "ds_type": "aws_glue",
        "auth_mode": "profile",
        "info": {
            "profile": "my-aws-profile",
            "region": "us-east-1"
        }
    }
}

Critical: The S3 bucket must be accessible from the Glue Data Catalog in the specified region.

Authentication Modes

Uses AWS CLI credentials from ~/.aws/credentials:

{
    "glue_catalog": {
        "ds_type": "aws_glue",
        "auth_mode": "profile",
        "info": {
            "profile": "my-aws-profile",
            "region": "us-east-1"
        }
    }
}
  • No secrets in config files
  • Supports MFA and SSO profiles
  • Easy credential rotation

Keys Authentication

Uses explicit AWS access keys:

{
    "glue_catalog": {
        "ds_type": "aws_glue",
        "auth_mode": "keys",
        "info": {
            "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
            "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
            "region": "us-east-1"
        }
    }
}

Suited for CI/CD pipelines, containers, or cross-account access without AWS CLI.

Role Authentication

Assumes an IAM role via STS:

{
    "glue_catalog": {
        "ds_type": "aws_glue",
        "auth_mode": "role",
        "info": {
            "role_arn": "arn:aws:iam::123456789012:role/GluePublishRole",
            "external_id": "optional-external-id",
            "region": "us-east-1"
        }
    }
}

Uses temporary credentials. Ideal for cross-account publishing and fine-grained access control.

Configuration Options

Option Description Default
--publish_target ID Credential ID for Glue target (required) -
--publish_schema_pattern PATTERN Glue database naming pattern {schema}
--publish_table_pattern PATTERN Table naming pattern {table}
--glue_skip_existing Skip existing tables instead of recreating false
--n_jobs N Parallel workers for table creation 1

Dynamic Naming Patterns

Database and table names support token-based patterns.

Supported Tokens

Token Description Example Output
{schema} Source schema name tpch_1
{table} Source table name customer
{database} Source database name tpch
{date} Current date (YYYYMMDD) 20251210
{timestamp} Current timestamp (YYYYMMDD_HHMMSS) 20251210_143022
{uuid} UUID4 (consistent per run) a1b2c3d4-...
{subpath} CLI --sub_path value staging

Common Patterns

Date-Partitioned Databases

lakexpress \
  --publish_schema_pattern "lx_{schema}_{date}" \
  --publish_table_pattern "{table}" \
  --publish_target glue_catalog \
  ...

# Results:
# Database: lx_tpch_1_20251210
# Tables: customer, orders, lineitem

Prefixed Databases

lakexpress \
  --publish_schema_pattern "datalake_{schema}" \
  --publish_table_pattern "{table}" \
  --publish_target glue_catalog \
  ...

# Results:
# Database: datalake_tpch_1
# Tables: customer, orders, lineitem

Consolidated Multi-Schema

lakexpress \
  --source_schema_name schema1,schema2 \
  --publish_schema_pattern "consolidated" \
  --publish_table_pattern "{schema}_{table}" \
  --publish_target glue_catalog \
  ...

# Results:
# Database: consolidated
# Tables: schema1_customer, schema2_customer

Usage Examples

Example 1: Basic Export to Glue

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target glue_catalog

Creates database public with all tables from the public schema.

Example 2: Custom Naming with Parallel Execution

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target glue_catalog \
    --publish_schema_pattern "lx_{schema}" \
    --publish_table_pattern "{table}" \
    --n_jobs 4

Creates database lx_tpch_1 with tables built in parallel.

Example 3: Skip Existing Tables

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target glue_catalog \
    --glue_skip_existing

Example 4: Date-Based Snapshots

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name sales \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target glue_catalog \
    --publish_schema_pattern "sales_{date}" \
    --sub_path "daily/$(date +%Y%m%d)"

Creates database sales_20251210, data stored at s3://bucket/lakexpress/daily/20251210/.

Data Type Mapping

Source types map automatically to Glue-compatible types.

PostgreSQL → Glue

PostgreSQL Type Glue Type
INTEGER, INT4 int
BIGINT, INT8 bigint
SMALLINT, INT2 smallint
NUMERIC(p,s) decimal(p,s)
REAL, FLOAT4 float
DOUBLE PRECISION double
VARCHAR(n), TEXT string
DATE date
TIMESTAMP timestamp
BOOLEAN boolean
BYTEA binary

SQL Server → Glue

SQL Server Type Glue Type
INT int
BIGINT bigint
SMALLINT smallint
TINYINT tinyint
DECIMAL(p,s) decimal(p,s)
FLOAT double
REAL float
VARCHAR(n), NVARCHAR(n) string
DATE date
DATETIME, DATETIME2 timestamp
BIT boolean
VARBINARY binary

CLI Reference

Glue Publishing Arguments

Option Type Description
--publish_target ID String Credential ID for Glue publishing (required)
--publish_schema_pattern PATTERN String Database naming pattern (default: {schema})
--publish_table_pattern PATTERN String Table naming pattern (default: {table})
--glue_skip_existing Flag Skip existing tables instead of drop and recreate
--n_jobs N Integer Parallel workers for table creation (default: 1)

Querying Glue Tables

Amazon Athena:

SELECT * FROM lx_tpch_1.customer LIMIT 10;

AWS Redshift Spectrum:

SELECT * FROM spectrum_schema.customer LIMIT 10;

Amazon EMR (Spark):

df = spark.table("lx_tpch_1.customer")
df.show(10)