FastBCP Configuration

LakeXpress uses FastBCP for high-performance data export with per-table control over export methods.

Configuration Approaches

1. Auto-Detection (Default)

LakeXpress selects the export method based on table size, database type, and available features. No configuration needed.

2. Database Table Configuration (Production)

Store table-specific settings in the partition_columns table of the export logging database. Use this for persistent, shared configuration across runs.

3. CLI Configuration (Development/Testing)

Use --fastbcp_table_config for quick, one-off testing.

Available Export Methods

Methods Without Key Column

Method Database Description Best For
None All Sequential export (no parallelism) Small tables (< 100K rows)
Ctid PostgreSQL Tuple ID-based parallel export Medium to large PostgreSQL tables
Rowid Oracle Row ID-based parallel export Medium to large Oracle tables
Physloc SQL Server Physical location-based parallel export Medium to large SQL Server tables

Methods With Key Column

Method Description Key Column Requirements Best For
Random Distributes using column values Integer or bigint column Tables with sequential IDs
DataDriven Splits on distinct column values Column with distinct values Tables partitioned by year, region, etc.
RangeId Range partitioning via min/max Numeric or date column with good distribution Large tables with sortable columns
Ntile Even distribution via column values Numeric column Tables requiring even distribution

Production: Database Configuration

Creating Table Configuration

Insert into the partition_columns table:

INSERT INTO partition_columns (
    source_db_type,
    source_database,
    source_schema,
    source_table,
    fastbcp_method,
    fastbcp_parallel_degree,
    fastbcp_distribution_key,
    env_name
) VALUES (
    'postgres',           -- Database type: postgres, oracle, mssql
    'tpch',              -- Database name
    'tpch_1',            -- Schema name
    'lineitem',          -- Table name
    'DataDriven',        -- FastBCP method
    8,                   -- Parallel degree (--fastbcp_p equivalent)
    'YEAR(l_shipdate)',  -- Distribution key column/expression
    'production'         -- Environment name (optional)
);

Examples

PostgreSQL - Ctid Method (No Key Column)

INSERT INTO partition_columns (
    source_db_type, source_database, source_schema, source_table,
    fastbcp_method, fastbcp_parallel_degree, fastbcp_distribution_key
) VALUES (
    'postgres', 'analytics', 'public', 'events',
    'Ctid', 4, NULL
);

Oracle - DataDriven Method

INSERT INTO partition_columns (
    source_db_type, source_database, source_schema, source_table,
    fastbcp_method, fastbcp_parallel_degree, fastbcp_distribution_key
) VALUES (
    'oracle', 'PRODDB', 'SALES', 'TRANSACTIONS',
    'DataDriven', 8, 'YEAR(TRANSACTION_DATE)'
);

SQL Server - Physloc Method

INSERT INTO partition_columns (
    source_db_type, source_database, source_schema, source_table,
    fastbcp_method, fastbcp_parallel_degree, fastbcp_distribution_key
) VALUES (
    'mssql', 'SalesDB', 'dbo', 'Orders',
    'Physloc', 4, NULL
);

RangeId Method with Numeric Column

INSERT INTO partition_columns (
    source_db_type, source_database, source_schema, source_table,
    fastbcp_method, fastbcp_parallel_degree, fastbcp_distribution_key
) VALUES (
    'postgres', 'ecommerce', 'public', 'customer',
    'RangeId', 2, 'c_custkey'
);

Development/Testing: CLI Configuration

Format

--fastbcp_table_config "table:method:key_column:parallel_degree"
  • Empty fields use defaults: :: means default key and parallel degree
  • Separate multiple tables with semicolons

Single Table Example

./LakeXpress -a auth.json --log_db_auth_id log_db \
        --source_db_auth_id source_pg \
        --source_schema_name tpch_1 \
        --fastbcp_table_config "lineitem:DataDriven:YEAR(l_shipdate):8" \
        --output_dir ./exports \
        --fastbcp_dir_path /opt/fastbcp

Multiple Tables Example

./LakeXpress -a auth.json --log_db_auth_id log_db \
        --source_db_auth_id source_pg \
        --source_schema_name tpch_1 \
        --fastbcp_table_config "lineitem:DataDriven:YEAR(l_shipdate):8;orders:Ctid::4;customer:RangeId:c_custkey:2" \
        --output_dir ./exports \
        --fastbcp_dir_path /opt/fastbcp

This configures:

  • lineitem: DataDriven, partition by YEAR(l_shipdate), 8 parallel processes
  • orders: Ctid, no key column, 4 parallel processes
  • customer: RangeId, partition by c_custkey, 2 parallel processes

Using Defaults

# Use default parallel degree (from --fastbcp_p or 1)
--fastbcp_table_config "orders:Ctid::"

# Use default method and parallel degree (auto-detection)
--fastbcp_table_config "orders:::"

Configuration Priority

From highest to lowest:

  1. partition_columns table - Database configuration
  2. –fastbcp_table_config - CLI configuration
  3. Auto-detection - Automatic method selection

Database table configuration always overrides CLI configuration.

Performance Tuning

Parallel Degree (--fastbcp_p)

Controls parallelism within large tables:

--fastbcp_p 4  # Export large tables using 4 parallel processes

Guidelines:

  • < 100K rows: Use 1 (default)
  • 100K - 10M rows: Use 2-4
  • 10M - 100M rows: Use 4-8
  • > 100M rows: Use 8-16

Higher parallelism is not always faster. Factor in CPU cores, source database load, and network bandwidth.

Large Table Threshold

--large_table_threshold 500000  # Tables with < 500K rows use sequential export

Default: 100,000 rows. Tables below this threshold export sequentially.

Table-Level Parallelism (--n_jobs)

Controls how many tables export simultaneously:

--n_jobs 8  # Export 8 tables in parallel

Example configurations:

# Balanced (medium server)
--n_jobs 4 --fastbcp_p 2

# Aggressive (large server)
--n_jobs 8 --fastbcp_p 4

# Conservative
--n_jobs 2 --fastbcp_p 1

Method Selection Guide

PostgreSQL Tables

Small tables (< 100K rows):

fastbcp_method = 'None'

Medium to large tables:

fastbcp_method = 'Ctid'
fastbcp_parallel_degree = 4

Tables with a good distribution column:

fastbcp_method = 'DataDriven'
fastbcp_distribution_key = 'YEAR(created_date)'
fastbcp_parallel_degree = 8

Oracle Tables

Rowid-based exports require SELECT_CATALOG_ROLE.

Small tables:

fastbcp_method = 'None'

Medium to large tables with SELECT_CATALOG_ROLE:

fastbcp_method = 'Rowid'
fastbcp_parallel_degree = 4

Without SELECT_CATALOG_ROLE, or with a distribution column:

fastbcp_method = 'DataDriven'
fastbcp_distribution_key = 'EXTRACT(YEAR FROM ORDER_DATE)'
fastbcp_parallel_degree = 8

SQL Server Tables

Small tables:

fastbcp_method = 'None'

Medium to large tables:

fastbcp_method = 'Physloc'
fastbcp_parallel_degree = 4

Tables with a good distribution column:

fastbcp_method = 'DataDriven'
fastbcp_distribution_key = 'YEAR(OrderDate)'
fastbcp_parallel_degree = 8

Distribution Key Examples

Good Distribution Keys

Year-based:

-- PostgreSQL
'YEAR(created_date)'
'EXTRACT(YEAR FROM order_date)'

-- Oracle
'EXTRACT(YEAR FROM ORDER_DATE)'
'TO_CHAR(CREATE_DATE, ''YYYY'')'

-- SQL Server
'YEAR(OrderDate)'
'DATEPART(YEAR, CreatedDate)'

Region/Category:

'region_id'
'category'
'department_code'

Numeric range:

'customer_id'
'order_id'
'account_number'

Poor Distribution Keys

Avoid:

  • Boolean columns - Only 2 values
  • Low cardinality columns - Too few distinct values
  • Highly skewed columns - Most rows share one value
  • NULL-heavy columns - Many NULL values

Monitoring and Troubleshooting

Check Export Performance

SELECT
    source_table,
    status,
    started_at,
    finished_at,
    finished_at - started_at AS duration,
    row_count
FROM jobs
WHERE run_id = 'your-run-id'
ORDER BY duration DESC;

Identify Slow Tables

SELECT
    source_table,
    finished_at - started_at AS duration,
    row_count,
    row_count / EXTRACT(EPOCH FROM (finished_at - started_at)) AS rows_per_second
FROM jobs
WHERE run_id = 'your-run-id' AND status = 'completed'
ORDER BY duration DESC
LIMIT 10;

Optimize Slow Tables

  1. Increase parallel degree (--fastbcp_p)
  2. Try a different FastBCP method
  3. Verify distribution key cardinality
  4. Check source DB for locks, slow queries, or resource constraints

Complete Example

./LakeXpress -a auth.json \
  --log_db_auth_id log_db_postgres \
  --source_db_auth_id source_postgres \
  --source_schema_name public \
  --fastbcp_dir_path ./FastBCP_linux-x64/latest/ \
  --n_jobs 8 \
  --fastbcp_p 4 \
  --large_table_threshold 500000 \
  --fastbcp_table_config "huge_table:DataDriven:YEAR(created):16;medium_table:Ctid::4" \
  --output_dir ./exports
  • 8 tables exported simultaneously (--n_jobs 8)
  • 4-way parallelism for large tables (--fastbcp_p 4)
  • Tables under 500K rows export sequentially
  • huge_table: DataDriven, 16-way parallelism
  • medium_table: Ctid, 4-way parallelism
  • All other tables use auto-detection

See Also