AWS Glue Publishing
LakeXpress creates AWS Glue Data Catalog tables from exported Parquet files, enabling queries via Athena, Redshift Spectrum, and EMR.
Table of Contents
- Prerequisites
- Authentication Modes
- Configuration Options
- Dynamic Naming Patterns
- Usage Examples
- Data Type Mapping
- CLI Reference
Prerequisites
1. AWS Glue Permissions
Required IAM permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:GetDatabase",
"glue:DeleteDatabase",
"glue:CreateTable",
"glue:GetTable",
"glue:DeleteTable",
"glue:UpdateTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:GetPartitions"
],
"Resource": "*"
}
]
}
2. S3 Storage Setup
Both S3 and Glue credentials are required in credentials.json:
{
"aws_s3_datalake": {
"ds_type": "s3",
"auth_mode": "profile",
"info": {
"directory": "s3://my-datalake-bucket/lakexpress/",
"profile": "my-aws-profile"
}
},
"glue_catalog": {
"ds_type": "aws_glue",
"auth_mode": "profile",
"info": {
"profile": "my-aws-profile",
"region": "us-east-1"
}
}
}
Critical: The S3 bucket must be accessible from the Glue Data Catalog in the specified region.
Authentication Modes
Profile Authentication (Recommended)
Uses AWS CLI credentials from ~/.aws/credentials:
{
"glue_catalog": {
"ds_type": "aws_glue",
"auth_mode": "profile",
"info": {
"profile": "my-aws-profile",
"region": "us-east-1"
}
}
}
- No secrets in config files
- Supports MFA and SSO profiles
- Easy credential rotation
Keys Authentication
Uses explicit AWS access keys:
{
"glue_catalog": {
"ds_type": "aws_glue",
"auth_mode": "keys",
"info": {
"aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
"aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"region": "us-east-1"
}
}
}
Suited for CI/CD pipelines, containers, or cross-account access without AWS CLI.
Role Authentication
Assumes an IAM role via STS:
{
"glue_catalog": {
"ds_type": "aws_glue",
"auth_mode": "role",
"info": {
"role_arn": "arn:aws:iam::123456789012:role/GluePublishRole",
"external_id": "optional-external-id",
"region": "us-east-1"
}
}
}
Uses temporary credentials. Ideal for cross-account publishing and fine-grained access control.
Configuration Options
| Option | Description | Default |
|---|---|---|
--publish_target ID |
Credential ID for Glue target (required) | - |
--publish_schema_pattern PATTERN |
Glue database naming pattern | {schema} |
--publish_table_pattern PATTERN |
Table naming pattern | {table} |
--glue_skip_existing |
Skip existing tables instead of recreating | false |
--n_jobs N |
Parallel workers for table creation | 1 |
Dynamic Naming Patterns
Database and table names support token-based patterns.
Supported Tokens
| Token | Description | Example Output |
|---|---|---|
{schema} |
Source schema name | tpch_1 |
{table} |
Source table name | customer |
{database} |
Source database name | tpch |
{date} |
Current date (YYYYMMDD) | 20251210 |
{timestamp} |
Current timestamp (YYYYMMDD_HHMMSS) | 20251210_143022 |
{uuid} |
UUID4 (consistent per run) | a1b2c3d4-... |
{subpath} |
CLI --sub_path value |
staging |
Common Patterns
Date-Partitioned Databases
lakexpress \
--publish_schema_pattern "lx_{schema}_{date}" \
--publish_table_pattern "{table}" \
--publish_target glue_catalog \
...
# Results:
# Database: lx_tpch_1_20251210
# Tables: customer, orders, lineitem
Prefixed Databases
lakexpress \
--publish_schema_pattern "datalake_{schema}" \
--publish_table_pattern "{table}" \
--publish_target glue_catalog \
...
# Results:
# Database: datalake_tpch_1
# Tables: customer, orders, lineitem
Consolidated Multi-Schema
lakexpress \
--source_schema_name schema1,schema2 \
--publish_schema_pattern "consolidated" \
--publish_table_pattern "{schema}_{table}" \
--publish_target glue_catalog \
...
# Results:
# Database: consolidated
# Tables: schema1_customer, schema2_customer
Usage Examples
Example 1: Basic Export to Glue
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name public \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target glue_catalog
Creates database public with all tables from the public schema.
Example 2: Custom Naming with Parallel Execution
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name tpch_1 \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target glue_catalog \
--publish_schema_pattern "lx_{schema}" \
--publish_table_pattern "{table}" \
--n_jobs 4
Creates database lx_tpch_1 with tables built in parallel.
Example 3: Skip Existing Tables
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name tpch_1 \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target glue_catalog \
--glue_skip_existing
Example 4: Date-Based Snapshots
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name sales \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target glue_catalog \
--publish_schema_pattern "sales_{date}" \
--sub_path "daily/$(date +%Y%m%d)"
Creates database sales_20251210, data stored at s3://bucket/lakexpress/daily/20251210/.
Data Type Mapping
Source types map automatically to Glue-compatible types.
PostgreSQL → Glue
| PostgreSQL Type | Glue Type |
|---|---|
INTEGER, INT4 |
int |
BIGINT, INT8 |
bigint |
SMALLINT, INT2 |
smallint |
NUMERIC(p,s) |
decimal(p,s) |
REAL, FLOAT4 |
float |
DOUBLE PRECISION |
double |
VARCHAR(n), TEXT |
string |
DATE |
date |
TIMESTAMP |
timestamp |
BOOLEAN |
boolean |
BYTEA |
binary |
SQL Server → Glue
| SQL Server Type | Glue Type |
|---|---|
INT |
int |
BIGINT |
bigint |
SMALLINT |
smallint |
TINYINT |
tinyint |
DECIMAL(p,s) |
decimal(p,s) |
FLOAT |
double |
REAL |
float |
VARCHAR(n), NVARCHAR(n) |
string |
DATE |
date |
DATETIME, DATETIME2 |
timestamp |
BIT |
boolean |
VARBINARY |
binary |
CLI Reference
Glue Publishing Arguments
| Option | Type | Description |
|---|---|---|
--publish_target ID |
String | Credential ID for Glue publishing (required) |
--publish_schema_pattern PATTERN |
String | Database naming pattern (default: {schema}) |
--publish_table_pattern PATTERN |
String | Table naming pattern (default: {table}) |
--glue_skip_existing |
Flag | Skip existing tables instead of drop and recreate |
--n_jobs N |
Integer | Parallel workers for table creation (default: 1) |
Querying Glue Tables
Amazon Athena:
SELECT * FROM lx_tpch_1.customer LIMIT 10;
AWS Redshift Spectrum:
SELECT * FROM spectrum_schema.customer LIMIT 10;
Amazon EMR (Spark):
df = spark.table("lx_tpch_1.customer")
df.show(10)