BigQuery Publishing

LakeXpress can automatically create BigQuery tables from Parquet files stored in GCS.

Prerequisites
Authentication
Configuration Options
Table Types
Dynamic Naming Patterns
Usage Examples
Data Type Mapping
CLI Reference

Prerequisites

1. Google Cloud Permissions

Required IAM roles:

BigQuery Data Editor (roles/bigquery.dataEditor) - Create datasets and tables
BigQuery Job User (roles/bigquery.jobUser) - Run load jobs (native tables)
Storage Object Viewer (roles/storage.objectViewer) - Read GCS files (external tables)

Or a custom role with these permissions:

bigquery.datasets.create
bigquery.datasets.get
bigquery.tables.create
bigquery.tables.get
bigquery.tables.delete
bigquery.tables.updateData
bigquery.jobs.create
storage.objects.get
storage.objects.list

2. GCS Storage Setup

Your credentials.json needs both GCS and BigQuery entries:

{
    "gcs_datalake": {
        "ds_type": "gcs",
        "auth_mode": "profile",
        "info": {
            "directory": "gs://my-datalake-bucket/lakexpress/",
            "profile": "/path/to/service-account.json"
        }
    },
    "bigquery_prod": {
        "ds_type": "bigquery",
        "auth_mode": "profile",
        "info": {
            "project_id": "my-gcp-project",
            "location": "US",
            "profile": "/path/to/service-account.json"
        }
    }
}

The same service account JSON file can be used for both GCS and BigQuery.

Authentication

LakeXpress uses service account authentication for BigQuery.

{
    "bigquery_prod": {
        "ds_type": "bigquery",
        "auth_mode": "profile",
        "info": {
            "project_id": "my-gcp-project",
            "location": "US",
            "profile": "/path/to/service-account.json"
        }
    }
}

Field	Required	Description
`ds_type`	Yes	Must be `"bigquery"`
`auth_mode`	Yes	Must be `"profile"`
`project_id`	Yes	GCP project ID where datasets will be created
`location`	No	BigQuery dataset location (default: `"US"`)
`profile`	Yes	Path to service account JSON key file

Common locations: US, EU, us-central1, europe-west1.

Creating a Service Account

Go to Google Cloud Console > IAM & Admin > Service Accounts
Create a service account with BigQuery and Storage permissions
Create and download a JSON key
Reference the key file path in your credentials

Configuration Options

Option	Description	Default
`--publish_target ID`	Credential ID for BigQuery target (required)	-
`--publish_schema_pattern PATTERN`	BigQuery dataset naming pattern	`{schema}`
`--publish_table_pattern PATTERN`	Table naming pattern	`{table}`
`--publish_method TYPE`	Table type: `external` or `internal`	`external`
`--n_jobs N`	Parallel workers for table creation	`1`

Table Types

External Tables (Default)

Queries data directly from GCS Parquet files without loading into BigQuery storage.

lakexpress ... --publish_target bigquery_prod --publish_method external

Fast publishing, no data copying
No BigQuery storage costs
Always reflects latest GCS data
Best for large datasets with infrequent queries
Query speed depends on GCS read performance
No clustering or partitioning support

Native Tables

Loads data from GCS into BigQuery’s columnar storage.

lakexpress ... --publish_target bigquery_prod --publish_method internal

Faster queries via optimized columnar storage
Full BigQuery features (clustering, partitioning, materialized views)
Best for frequently queried data
Slower publishing due to data loading
Incurs BigQuery storage costs
Data is a point-in-time snapshot

Dynamic Naming Patterns

Dataset and table names support token-based patterns.

Supported Tokens

Token	Description	Example Output
`{schema}`	Source schema name	`tpch_1`
`{table}`	Source table name	`customer`
`{database}`	Source database name	`tpch`
`{date}`	Current date (YYYYMMDD)	`20251210`
`{timestamp}`	Current timestamp (YYYYMMDD_HHMMSS)	`20251210_143022`
`{uuid}`	UUID4 identifier (consistent per run)	`a1b2c3d4`
`{subpath}`	CLI `--sub_path` parameter	`staging`

Common Patterns

Prefixed Datasets

lakexpress \
  --publish_schema_pattern "lx_{schema}" \
  --publish_table_pattern "{table}" \
  --publish_target bigquery_prod \
  ...

# Results:
# Dataset: lx_tpch_1
# Tables: customer, orders, lineitem

Date-Partitioned Datasets

lakexpress \
  --publish_schema_pattern "{schema}_{date}" \
  --publish_table_pattern "{table}" \
  --publish_target bigquery_prod \
  ...

# Results:
# Dataset: tpch_1_20251210
# Tables: customer, orders, lineitem

Consolidated Multi-Schema

lakexpress \
  --source_schema_name schema1,schema2 \
  --publish_schema_pattern "consolidated" \
  --publish_table_pattern "{schema}_{table}" \
  --publish_target bigquery_prod \
  ...

# Results:
# Dataset: consolidated
# Tables: schema1_customer, schema2_customer

Usage Examples

Example 1: External Tables (Default)

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id gcs_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_prod

Creates dataset public with external tables pointing to GCS Parquet files.

Example 2: Native Tables with Parallel Execution

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id gcs_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_prod \
    --publish_method internal \
    --publish_schema_pattern "lx_{schema}" \
    --n_jobs 4

Creates dataset lx_tpch_1 with native tables loaded from GCS.

Example 3: Daily Snapshots

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name sales \
    --target_storage_id gcs_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_prod \
    --publish_schema_pattern "sales_{date}" \
    --sub_path "daily/$(date +%Y%m%d)"

Creates dataset sales_20251210, data stored in gs://bucket/lakexpress/daily/20251210/.

Example 4: EU Data Residency

{
    "bigquery_eu": {
        "ds_type": "bigquery",
        "auth_mode": "profile",
        "info": {
            "project_id": "my-eu-project",
            "location": "EU",
            "profile": "/path/to/service-account.json"
        }
    }
}

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name gdpr_data \
    --target_storage_id gcs_eu \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_eu

Data Type Mapping

Source database types are mapped to BigQuery types automatically.

PostgreSQL to BigQuery

PostgreSQL Type	BigQuery Type
`INTEGER`, `INT4`	`INT64`
`BIGINT`, `INT8`	`INT64`
`SMALLINT`, `INT2`	`INT64`
`NUMERIC(p,s)`	`NUMERIC`
`REAL`, `FLOAT4`	`FLOAT64`
`DOUBLE PRECISION`	`FLOAT64`
`VARCHAR(n)`, `TEXT`	`STRING`
`DATE`	`DATE`
`TIMESTAMP`	`TIMESTAMP`
`TIMESTAMPTZ`	`TIMESTAMP`
`TIME`	`TIME`
`BOOLEAN`	`BOOL`
`BYTEA`	`BYTES`
`JSON`, `JSONB`	`JSON`
`UUID`	`STRING`

SQL Server to BigQuery

SQL Server Type	BigQuery Type
`INT`	`INT64`
`BIGINT`	`INT64`
`SMALLINT`	`INT64`
`TINYINT`	`INT64`
`DECIMAL(p,s)`	`NUMERIC`
`MONEY`	`NUMERIC`
`FLOAT`	`FLOAT64`
`REAL`	`FLOAT64`
`VARCHAR(n)`, `NVARCHAR(n)`	`STRING`
`DATE`	`DATE`
`DATETIME`, `DATETIME2`	`DATETIME`
`DATETIMEOFFSET`	`TIMESTAMP`
`TIME`	`TIME`
`BIT`	`BOOL`
`VARBINARY`	`BYTES`
`UNIQUEIDENTIFIER`	`STRING`

Oracle to BigQuery

Oracle Type	BigQuery Type
`NUMBER`	`NUMERIC`
`NUMBER(p,s)`	`NUMERIC`
`FLOAT`	`FLOAT64`
`VARCHAR2`, `NVARCHAR2`	`STRING`
`CLOB`, `NCLOB`	`STRING`
`DATE`	`TIMESTAMP`
`TIMESTAMP`	`TIMESTAMP`
`RAW`, `BLOB`	`BYTES`

CLI Reference