Skip to main content
Enterprise plan Helm chart Pre-aggregate workers handle two jobs:
  1. Materializations — Run a query against your warehouse, convert the results to a materialization format, and upload to S3
  2. DuckDB queries — When a user query matches a pre-aggregate, read the materialized data from S3 and execute the query using DuckDB
Both job types are distributed via the NATS pre-aggregate stream.

Pre-aggregate materializations

Scheduled jobs materialize warehouse query results and store them on S3:

Pre-aggregate queries

When a user query matches a pre-aggregate, the worker serves it using DuckDB against materialized data on S3 — without hitting your data warehouse:

Prerequisites

Example configuration

A complete Helm values configuration with NATS, warehouse worker, and pre-aggregate worker:
nats:
  enabled: true
  config:
    cluster:
      enabled: false
    jetstream:
      enabled: true
      fileStore:
        enabled: false
      memoryStore:
        enabled: true
        maxSize: 1Gi

warehouseNatsWorker:
  enabled: true
  replicas: 1
  concurrency: 100
  resources:
    requests:
      cpu: 250m
      memory: 1.5Gi
    limits:
      memory: 1.5Gi

preAggregateNatsWorker:
  enabled: true
  replicas: 1
  concurrency: 100
  resources:
    requests:
      cpu: 650m
      memory: 4Gi
      ephemeral-storage: 9Gi
    limits:
      memory: 4Gi
      ephemeral-storage: 9Gi
The Helm chart auto-configures these environment variables:
VariableSet fromValue
NATS_ENABLEDnats.enabled: true"true"
NATS_URLnats.enabled: truenats://<release>-nats:4222
NATS_WORKER_CONCURRENCYpreAggregateNatsWorker.concurrency100
PRE_AGGREGATES_ENABLEDpreAggregateNatsWorker.enabled: true"true"
PRE_AGGREGATES_PARQUET_ENABLEDpreAggregateNatsWorker.enabled: true"true"
See the overview for details on JetStream configuration options.

S3 storage configuration

Pre-aggregates require a dedicated S3 bucket separate from your main Lightdash results cache bucket. This prevents query history cleanup from deleting active materialization files.
VariableRequiredDescriptionFallback
S3_ENDPOINTYesS3-compatible endpoint URL
PRE_AGGREGATE_RESULTS_S3_BUCKETYesDedicated bucket for materialized data
PRE_AGGREGATE_RESULTS_S3_REGIONYesS3 region for the bucket
PRE_AGGREGATE_RESULTS_S3_ACCESS_KEYNoAccess key for the bucketS3_ACCESS_KEY
PRE_AGGREGATE_RESULTS_S3_SECRET_KEYNoSecret key for the bucketS3_SECRET_KEY
S3_ENDPOINT and S3_FORCE_PATH_STYLE are inherited from your base S3 configuration. Access keys fall back to the base S3 credentials if not set separately.
configMap:
  S3_ENDPOINT: "https://s3.us-east-1.amazonaws.com"
  PRE_AGGREGATE_RESULTS_S3_BUCKET: "my-lightdash-pre-aggs"
  PRE_AGGREGATE_RESULTS_S3_REGION: "us-east-1"
secrets:
  PRE_AGGREGATE_RESULTS_S3_ACCESS_KEY: "AKIA..."
  PRE_AGGREGATE_RESULTS_S3_SECRET_KEY: "..."
We recommend setting a retention / lifecycle policy on the pre-aggregate bucket to automatically clean up old files. Lightdash manages its own materializations, but a lifecycle policy prevents orphaned files from accumulating. Choose a retention period that makes sense for your deployment.

Configuration reference

All configuration is set through your Helm values.yaml under preAggregateNatsWorker:

Scaling

Helm valueDefaultDescription
preAggregateNatsWorker.replicas1Number of worker pods. Scale horizontally for more parallel capacity.
preAggregateNatsWorker.concurrency100Maximum concurrent jobs per pod. Maps to NATS_WORKER_CONCURRENCY env var.

Resources

Helm valueRecommended (request)Recommended (limit)Description
preAggregateNatsWorker.resources.requests.cpu650mCPU request per pod
preAggregateNatsWorker.resources.requests.memory4Gi4GiMemory request and limit per pod
preAggregateNatsWorker.resources.requests.ephemeral-storage9Gi9GiLocal disk for temporary files during materialization
Pre-aggregate workers need significantly more resources than warehouse workers because they run DuckDB in-process for both materializing data and serving queries against materialized data.
Ephemeral storage is critical. During materialization, warehouse query results are written to a local temporary file before being converted and uploaded to S3. Large materializations can consume several gigabytes of local disk. If the pod runs out of ephemeral storage, it will be evicted.

DuckDB memory tuning

DuckDB runs inside the pre-aggregate worker process. There are two types of DuckDB instances:
Instance typeUsed forMemory limitConcurrency
Shared query instanceServing pre-aggregate queriesConfigurable (see below)Shared across all concurrent queries
Isolated materialization instancesConverting and uploading results256MB per instance, 1 threadOne per active materialization
By default, the shared query instance has no memory cap. Under concurrent load, this can cause OOM kills. Set a limit:
configMap:
  PRE_AGGREGATE_DUCKDB_QUERY_MEMORY_LIMIT: "3GB"
Sizing guideline: Start with 2–3GB and adjust based on observed memory usage. The limit should leave enough headroom for the Node.js process, active materializations, and OS overhead within the pod’s total memory.

Optional environment variables

These can be set via extraEnv or configMap if you need to override the defaults:
VariableDefaultDescription
NATS_QUEUE_TIMEOUT_MS180000 (3 min)How long a message can wait in the queue before being discarded.
PRE_AGGREGATE_DUCKDB_QUERY_MEMORY_LIMITUnlimitedMemory cap for the shared DuckDB query instance (e.g., 2GB, 3GB). When unset, DuckDB will use all available pod memory and is likely to cause OOM kills under concurrent load. We strongly recommend setting this.
PRE_AGGREGATES_MAX_ROWSUnlimitedMaximum rows per materialization. Results are truncated to this limit with a warning. Can also be set per pre-aggregate in dbt YAML via max_rows.

Troubleshooting

Pre-aggregate queries hitting the warehouse instead of DuckDB

  1. Verify that PRE_AGGREGATES_ENABLED is set to "true" on the pre-aggregate worker pod
  2. Verify that PRE_AGGREGATES_PARQUET_ENABLED is set to "true" on the pre-aggregate worker pod
  3. Confirm the pre-aggregate worker pod is running and healthy
  4. Check that an active materialization exists in Project Settings > Pre-aggregates
  5. Review query matching rules — see monitoring and debugging

Worker OOM kills

Pre-aggregate workers run DuckDB which can consume significant memory:
  1. Set PRE_AGGREGATE_DUCKDB_QUERY_MEMORY_LIMIT (e.g., 3GB) to cap DuckDB memory
  2. Increase the worker’s memory request and limit
  3. Reduce concurrency to limit parallel DuckDB queries

Materialization failures

Common causes:
  • S3 access denied — Verify PRE_AGGREGATE_RESULTS_S3_* credentials and bucket permissions
  • Warehouse timeout — Large materializations may exceed warehouse query timeout limits
  • Disk pressure — Materialization writes temporary files to local disk. Increase ephemeral storage if you see evictions.
  • Too many rows — Materializations should not contain very large datasets. We recommend keeping materializations under 100,000 rows for optimal performance. You can use max_rows in your pre-aggregate definition or the PRE_AGGREGATES_MAX_ROWS environment variable to enforce a limit.