BigQuery Publishing
LakeXpress can automatically create BigQuery tables from Parquet files stored in GCS.
Table of Contents
- Prerequisites
- Authentication
- Configuration Options
- Table Types
- Dynamic Naming Patterns
- Usage Examples
- Data Type Mapping
- CLI Reference
Prerequisites
1. Google Cloud Permissions
Required IAM roles:
- BigQuery Data Editor (
roles/bigquery.dataEditor) - Create datasets and tables - BigQuery Job User (
roles/bigquery.jobUser) - Run load jobs (native tables) - Storage Object Viewer (
roles/storage.objectViewer) - Read GCS files (external tables)
Or a custom role with these permissions:
bigquery.datasets.create
bigquery.datasets.get
bigquery.tables.create
bigquery.tables.get
bigquery.tables.delete
bigquery.tables.updateData
bigquery.jobs.create
storage.objects.get
storage.objects.list
2. GCS Storage Setup
Your credentials.json needs both GCS and BigQuery entries:
{
"gcs_datalake": {
"ds_type": "gcs",
"auth_mode": "profile",
"info": {
"directory": "gs://my-datalake-bucket/lakexpress/",
"profile": "/path/to/service-account.json"
}
},
"bigquery_prod": {
"ds_type": "bigquery",
"auth_mode": "profile",
"info": {
"project_id": "my-gcp-project",
"location": "US",
"profile": "/path/to/service-account.json"
}
}
}
The same service account JSON file can be used for both GCS and BigQuery.
Authentication
LakeXpress uses service account authentication for BigQuery.
{
"bigquery_prod": {
"ds_type": "bigquery",
"auth_mode": "profile",
"info": {
"project_id": "my-gcp-project",
"location": "US",
"profile": "/path/to/service-account.json"
}
}
}
| Field | Required | Description |
|---|---|---|
ds_type |
Yes | Must be "bigquery" |
auth_mode |
Yes | Must be "profile" |
project_id |
Yes | GCP project ID where datasets will be created |
location |
No | BigQuery dataset location (default: "US") |
profile |
Yes | Path to service account JSON key file |
Common locations: US, EU, us-central1, europe-west1.
Creating a Service Account
- Go to Google Cloud Console > IAM & Admin > Service Accounts
- Create a service account with BigQuery and Storage permissions
- Create and download a JSON key
- Reference the key file path in your credentials
Configuration Options
| Option | Description | Default |
|---|---|---|
--publish_target ID |
Credential ID for BigQuery target (required) | - |
--publish_schema_pattern PATTERN |
BigQuery dataset naming pattern | {schema} |
--publish_table_pattern PATTERN |
Table naming pattern | {table} |
--publish_method TYPE |
Table type: external or internal |
external |
--n_jobs N |
Parallel workers for table creation | 1 |
Table Types
External Tables (Default)
Queries data directly from GCS Parquet files without loading into BigQuery storage.
lakexpress ... --publish_target bigquery_prod --publish_method external
- Fast publishing, no data copying
- No BigQuery storage costs
- Always reflects latest GCS data
- Best for large datasets with infrequent queries
- Query speed depends on GCS read performance
- No clustering or partitioning support
Native Tables
Loads data from GCS into BigQuery’s columnar storage.
lakexpress ... --publish_target bigquery_prod --publish_method internal
- Faster queries via optimized columnar storage
- Full BigQuery features (clustering, partitioning, materialized views)
- Best for frequently queried data
- Slower publishing due to data loading
- Incurs BigQuery storage costs
- Data is a point-in-time snapshot
Dynamic Naming Patterns
Dataset and table names support token-based patterns.
Supported Tokens
| Token | Description | Example Output |
|---|---|---|
{schema} |
Source schema name | tpch_1 |
{table} |
Source table name | customer |
{database} |
Source database name | tpch |
{date} |
Current date (YYYYMMDD) | 20251210 |
{timestamp} |
Current timestamp (YYYYMMDD_HHMMSS) | 20251210_143022 |
{uuid} |
UUID4 identifier (consistent per run) | a1b2c3d4 |
{subpath} |
CLI --sub_path parameter |
staging |
Common Patterns
Prefixed Datasets
lakexpress \
--publish_schema_pattern "lx_{schema}" \
--publish_table_pattern "{table}" \
--publish_target bigquery_prod \
...
# Results:
# Dataset: lx_tpch_1
# Tables: customer, orders, lineitem
Date-Partitioned Datasets
lakexpress \
--publish_schema_pattern "{schema}_{date}" \
--publish_table_pattern "{table}" \
--publish_target bigquery_prod \
...
# Results:
# Dataset: tpch_1_20251210
# Tables: customer, orders, lineitem
Consolidated Multi-Schema
lakexpress \
--source_schema_name schema1,schema2 \
--publish_schema_pattern "consolidated" \
--publish_table_pattern "{schema}_{table}" \
--publish_target bigquery_prod \
...
# Results:
# Dataset: consolidated
# Tables: schema1_customer, schema2_customer
Usage Examples
Example 1: External Tables (Default)
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name public \
--target_storage_id gcs_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_prod
Creates dataset public with external tables pointing to GCS Parquet files.
Example 2: Native Tables with Parallel Execution
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name tpch_1 \
--target_storage_id gcs_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_prod \
--publish_method internal \
--publish_schema_pattern "lx_{schema}" \
--n_jobs 4
Creates dataset lx_tpch_1 with native tables loaded from GCS.
Example 3: Daily Snapshots
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name sales \
--target_storage_id gcs_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_prod \
--publish_schema_pattern "sales_{date}" \
--sub_path "daily/$(date +%Y%m%d)"
Creates dataset sales_20251210, data stored in gs://bucket/lakexpress/daily/20251210/.
Example 4: EU Data Residency
{
"bigquery_eu": {
"ds_type": "bigquery",
"auth_mode": "profile",
"info": {
"project_id": "my-eu-project",
"location": "EU",
"profile": "/path/to/service-account.json"
}
}
}
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name gdpr_data \
--target_storage_id gcs_eu \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_eu
Data Type Mapping
Source database types are mapped to BigQuery types automatically.
PostgreSQL to BigQuery
| PostgreSQL Type | BigQuery Type |
|---|---|
INTEGER, INT4 |
INT64 |
BIGINT, INT8 |
INT64 |
SMALLINT, INT2 |
INT64 |
NUMERIC(p,s) |
NUMERIC |
REAL, FLOAT4 |
FLOAT64 |
DOUBLE PRECISION |
FLOAT64 |
VARCHAR(n), TEXT |
STRING |
DATE |
DATE |
TIMESTAMP |
TIMESTAMP |
TIMESTAMPTZ |
TIMESTAMP |
TIME |
TIME |
BOOLEAN |
BOOL |
BYTEA |
BYTES |
JSON, JSONB |
JSON |
UUID |
STRING |
SQL Server to BigQuery
| SQL Server Type | BigQuery Type |
|---|---|
INT |
INT64 |
BIGINT |
INT64 |
SMALLINT |
INT64 |
TINYINT |
INT64 |
DECIMAL(p,s) |
NUMERIC |
MONEY |
NUMERIC |
FLOAT |
FLOAT64 |
REAL |
FLOAT64 |
VARCHAR(n), NVARCHAR(n) |
STRING |
DATE |
DATE |
DATETIME, DATETIME2 |
DATETIME |
DATETIMEOFFSET |
TIMESTAMP |
TIME |
TIME |
BIT |
BOOL |
VARBINARY |
BYTES |
UNIQUEIDENTIFIER |
STRING |
Oracle to BigQuery
| Oracle Type | BigQuery Type |
|---|---|
NUMBER |
NUMERIC |
NUMBER(p,s) |
NUMERIC |
FLOAT |
FLOAT64 |
VARCHAR2, NVARCHAR2 |
STRING |
CLOB, NCLOB |
STRING |
DATE |
TIMESTAMP |
TIMESTAMP |
TIMESTAMP |
RAW, BLOB |
BYTES |
CLI Reference
BigQuery Publishing Arguments
| Option | Type | Description |
|---|---|---|
--publish_target ID |
String | Credential ID for BigQuery publishing (required) |
--publish_schema_pattern PATTERN |
String | Dynamic dataset naming pattern (default: {schema}) |
--publish_table_pattern PATTERN |
String | Dynamic table naming pattern (default: {table}) |
--publish_method TYPE |
String | Table type: external (default) or internal |
--n_jobs N |
Integer | Parallel workers for table creation (default: 1) |
Querying BigQuery Tables
BigQuery Console:
SELECT * FROM `my-project.lx_tpch_1.customer` LIMIT 10;
bq CLI:
bq query --use_legacy_sql=false \
'SELECT * FROM `my-project.lx_tpch_1.customer` LIMIT 10'
Python (google-cloud-bigquery):
from google.cloud import bigquery
client = bigquery.Client()
query = "SELECT * FROM `my-project.lx_tpch_1.customer` LIMIT 10"
df = client.query(query).to_dataframe()
print(df)
pandas-gbq:
import pandas_gbq
query = "SELECT * FROM `my-project.lx_tpch_1.customer` LIMIT 10"
df = pandas_gbq.read_gbq(query, project_id="my-project")
print(df)
Troubleshooting
Common Issues
“Permission denied” errors:
- Verify the service account has BigQuery Data Editor role
- Check GCS bucket access from the service account
- Ensure project ID in credentials matches the target project
“Dataset not found” errors:
- Check that
locationmatches your GCS bucket region - Cross-region access between GCS and BigQuery may cause issues
“Invalid table” errors for external tables:
- Verify GCS path contains valid Parquet files
- Check schema mapping for your data types
Verifying Setup
Test BigQuery connectivity:
lakexpress -a credentials.json \
--log_db_auth_id log_db \
--source_db_auth_id postgres_prod \
--source_schema_name public \
--target_storage_id gcs_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_prod \
--tables customer \
--dry_run
Validate credentials:
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(
'/path/to/service-account.json'
)
client = bigquery.Client(credentials=credentials, project='my-project')
print(list(client.list_datasets()))