BigQuery Publishing

LakeXpress can automatically create BigQuery tables from Parquet files stored in GCS.

Table of Contents

Prerequisites

1. Google Cloud Permissions

Required IAM roles:

  • BigQuery Data Editor (roles/bigquery.dataEditor) - Create datasets and tables
  • BigQuery Job User (roles/bigquery.jobUser) - Run load jobs (native tables)
  • Storage Object Viewer (roles/storage.objectViewer) - Read GCS files (external tables)

Or a custom role with these permissions:

bigquery.datasets.create
bigquery.datasets.get
bigquery.tables.create
bigquery.tables.get
bigquery.tables.delete
bigquery.tables.updateData
bigquery.jobs.create
storage.objects.get
storage.objects.list

2. GCS Storage Setup

Your credentials.json needs both GCS and BigQuery entries:

{
    "gcs_datalake": {
        "ds_type": "gcs",
        "auth_mode": "profile",
        "info": {
            "directory": "gs://my-datalake-bucket/lakexpress/",
            "profile": "/path/to/service-account.json"
        }
    },
    "bigquery_prod": {
        "ds_type": "bigquery",
        "auth_mode": "profile",
        "info": {
            "project_id": "my-gcp-project",
            "location": "US",
            "profile": "/path/to/service-account.json"
        }
    }
}

The same service account JSON file can be used for both GCS and BigQuery.

Authentication

LakeXpress uses service account authentication for BigQuery.

{
    "bigquery_prod": {
        "ds_type": "bigquery",
        "auth_mode": "profile",
        "info": {
            "project_id": "my-gcp-project",
            "location": "US",
            "profile": "/path/to/service-account.json"
        }
    }
}
Field Required Description
ds_type Yes Must be "bigquery"
auth_mode Yes Must be "profile"
project_id Yes GCP project ID where datasets will be created
location No BigQuery dataset location (default: "US")
profile Yes Path to service account JSON key file

Common locations: US, EU, us-central1, europe-west1.

Creating a Service Account

  1. Go to Google Cloud Console > IAM & Admin > Service Accounts
  2. Create a service account with BigQuery and Storage permissions
  3. Create and download a JSON key
  4. Reference the key file path in your credentials

Configuration Options

Option Description Default
--publish_target ID Credential ID for BigQuery target (required) -
--publish_schema_pattern PATTERN BigQuery dataset naming pattern {schema}
--publish_table_pattern PATTERN Table naming pattern {table}
--publish_method TYPE Table type: external or internal external
--n_jobs N Parallel workers for table creation 1

Table Types

External Tables (Default)

Queries data directly from GCS Parquet files without loading into BigQuery storage.

lakexpress ... --publish_target bigquery_prod --publish_method external
  • Fast publishing, no data copying
  • No BigQuery storage costs
  • Always reflects latest GCS data
  • Best for large datasets with infrequent queries
  • Query speed depends on GCS read performance
  • No clustering or partitioning support

Native Tables

Loads data from GCS into BigQuery’s columnar storage.

lakexpress ... --publish_target bigquery_prod --publish_method internal
  • Faster queries via optimized columnar storage
  • Full BigQuery features (clustering, partitioning, materialized views)
  • Best for frequently queried data
  • Slower publishing due to data loading
  • Incurs BigQuery storage costs
  • Data is a point-in-time snapshot

Dynamic Naming Patterns

Dataset and table names support token-based patterns.

Supported Tokens

Token Description Example Output
{schema} Source schema name tpch_1
{table} Source table name customer
{database} Source database name tpch
{date} Current date (YYYYMMDD) 20251210
{timestamp} Current timestamp (YYYYMMDD_HHMMSS) 20251210_143022
{uuid} UUID4 identifier (consistent per run) a1b2c3d4
{subpath} CLI --sub_path parameter staging

Common Patterns

Prefixed Datasets

lakexpress \
  --publish_schema_pattern "lx_{schema}" \
  --publish_table_pattern "{table}" \
  --publish_target bigquery_prod \
  ...

# Results:
# Dataset: lx_tpch_1
# Tables: customer, orders, lineitem

Date-Partitioned Datasets

lakexpress \
  --publish_schema_pattern "{schema}_{date}" \
  --publish_table_pattern "{table}" \
  --publish_target bigquery_prod \
  ...

# Results:
# Dataset: tpch_1_20251210
# Tables: customer, orders, lineitem

Consolidated Multi-Schema

lakexpress \
  --source_schema_name schema1,schema2 \
  --publish_schema_pattern "consolidated" \
  --publish_table_pattern "{schema}_{table}" \
  --publish_target bigquery_prod \
  ...

# Results:
# Dataset: consolidated
# Tables: schema1_customer, schema2_customer

Usage Examples

Example 1: External Tables (Default)

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id gcs_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_prod

Creates dataset public with external tables pointing to GCS Parquet files.

Example 2: Native Tables with Parallel Execution

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id gcs_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_prod \
    --publish_method internal \
    --publish_schema_pattern "lx_{schema}" \
    --n_jobs 4

Creates dataset lx_tpch_1 with native tables loaded from GCS.

Example 3: Daily Snapshots

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name sales \
    --target_storage_id gcs_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_prod \
    --publish_schema_pattern "sales_{date}" \
    --sub_path "daily/$(date +%Y%m%d)"

Creates dataset sales_20251210, data stored in gs://bucket/lakexpress/daily/20251210/.

Example 4: EU Data Residency

{
    "bigquery_eu": {
        "ds_type": "bigquery",
        "auth_mode": "profile",
        "info": {
            "project_id": "my-eu-project",
            "location": "EU",
            "profile": "/path/to/service-account.json"
        }
    }
}
lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name gdpr_data \
    --target_storage_id gcs_eu \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_eu

Data Type Mapping

Source database types are mapped to BigQuery types automatically.

PostgreSQL to BigQuery

PostgreSQL Type BigQuery Type
INTEGER, INT4 INT64
BIGINT, INT8 INT64
SMALLINT, INT2 INT64
NUMERIC(p,s) NUMERIC
REAL, FLOAT4 FLOAT64
DOUBLE PRECISION FLOAT64
VARCHAR(n), TEXT STRING
DATE DATE
TIMESTAMP TIMESTAMP
TIMESTAMPTZ TIMESTAMP
TIME TIME
BOOLEAN BOOL
BYTEA BYTES
JSON, JSONB JSON
UUID STRING

SQL Server to BigQuery

SQL Server Type BigQuery Type
INT INT64
BIGINT INT64
SMALLINT INT64
TINYINT INT64
DECIMAL(p,s) NUMERIC
MONEY NUMERIC
FLOAT FLOAT64
REAL FLOAT64
VARCHAR(n), NVARCHAR(n) STRING
DATE DATE
DATETIME, DATETIME2 DATETIME
DATETIMEOFFSET TIMESTAMP
TIME TIME
BIT BOOL
VARBINARY BYTES
UNIQUEIDENTIFIER STRING

Oracle to BigQuery

Oracle Type BigQuery Type
NUMBER NUMERIC
NUMBER(p,s) NUMERIC
FLOAT FLOAT64
VARCHAR2, NVARCHAR2 STRING
CLOB, NCLOB STRING
DATE TIMESTAMP
TIMESTAMP TIMESTAMP
RAW, BLOB BYTES

CLI Reference

BigQuery Publishing Arguments

Option Type Description
--publish_target ID String Credential ID for BigQuery publishing (required)
--publish_schema_pattern PATTERN String Dynamic dataset naming pattern (default: {schema})
--publish_table_pattern PATTERN String Dynamic table naming pattern (default: {table})
--publish_method TYPE String Table type: external (default) or internal
--n_jobs N Integer Parallel workers for table creation (default: 1)

Querying BigQuery Tables

BigQuery Console:

SELECT * FROM `my-project.lx_tpch_1.customer` LIMIT 10;

bq CLI:

bq query --use_legacy_sql=false \
  'SELECT * FROM `my-project.lx_tpch_1.customer` LIMIT 10'

Python (google-cloud-bigquery):

from google.cloud import bigquery

client = bigquery.Client()
query = "SELECT * FROM `my-project.lx_tpch_1.customer` LIMIT 10"
df = client.query(query).to_dataframe()
print(df)

pandas-gbq:

import pandas_gbq

query = "SELECT * FROM `my-project.lx_tpch_1.customer` LIMIT 10"
df = pandas_gbq.read_gbq(query, project_id="my-project")
print(df)

Troubleshooting

Common Issues

“Permission denied” errors:

  • Verify the service account has BigQuery Data Editor role
  • Check GCS bucket access from the service account
  • Ensure project ID in credentials matches the target project

“Dataset not found” errors:

  • Check that location matches your GCS bucket region
  • Cross-region access between GCS and BigQuery may cause issues

“Invalid table” errors for external tables:

  • Verify GCS path contains valid Parquet files
  • Check schema mapping for your data types

Verifying Setup

Test BigQuery connectivity:

lakexpress -a credentials.json \
    --log_db_auth_id log_db \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id gcs_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target bigquery_prod \
    --tables customer \
    --dry_run

Validate credentials:

from google.cloud import bigquery
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    '/path/to/service-account.json'
)
client = bigquery.Client(credentials=credentials, project='my-project')
print(list(client.list_datasets()))