Pre-aggregate workers

Enterprise plan Helm chart Pre-aggregate workers handle two jobs:

Materializations — Run a query against your warehouse, convert the results to a materialization format, and upload to S3
DuckDB queries — When a user query matches a pre-aggregate, read the materialized data from S3 and execute the query using DuckDB

Both job types are distributed via the NATS pre-aggregate stream.

Pre-aggregate materializations

Scheduled jobs materialize warehouse query results and store them on S3:

Pre-aggregate queries

When a user query matches a pre-aggregate, the worker serves it using DuckDB against materialized data on S3 — without hitting your data warehouse:

Prerequisites

A valid Enterprise license key
An S3-compatible bucket for materialized data (AWS S3, GCS, MinIO)
NATS and warehouse workers enabled (see NATS workers overview)

Example configuration

A complete Helm values configuration with NATS, warehouse worker, and pre-aggregate worker:

nats:
  enabled: true
  config:
    cluster:
      enabled: false
    jetstream:
      enabled: true
      fileStore:
        enabled: false
      memoryStore:
        enabled: true
        maxSize: 1Gi

warehouseNatsWorker:
  enabled: true
  replicas: 1
  concurrency: 100
  resources:
    requests:
      cpu: 250m
      memory: 1.5Gi
    limits:
      memory: 1.5Gi

preAggregateNatsWorker:
  enabled: true
  replicas: 1
  concurrency: 100
  resources:
    requests:
      cpu: 650m
      memory: 4Gi
      ephemeral-storage: 9Gi
    limits:
      memory: 4Gi
      ephemeral-storage: 9Gi

The Helm chart auto-configures these environment variables:

Variable	Set from	Value
`NATS_ENABLED`	`nats.enabled: true`	`"true"`
`NATS_URL`	`nats.enabled: true`	`nats://<release>-nats:4222`
`NATS_WORKER_CONCURRENCY`	`preAggregateNatsWorker.concurrency`	`100`
`PRE_AGGREGATES_ENABLED`	`preAggregateNatsWorker.enabled: true`	`"true"`
`PRE_AGGREGATES_PARQUET_ENABLED`	`preAggregateNatsWorker.enabled: true`	`"true"`

See the overview for details on JetStream configuration options.

S3 storage configuration

Pre-aggregates require a dedicated S3 bucket separate from your main Lightdash results cache bucket. This prevents query history cleanup from deleting active materialization files.

Variable	Required	Description	Fallback
`S3_ENDPOINT`	Yes	S3-compatible endpoint URL	—
`PRE_AGGREGATE_RESULTS_S3_BUCKET`	Yes	Dedicated bucket for materialized data	—
`PRE_AGGREGATE_RESULTS_S3_REGION`	Yes	S3 region for the bucket	—
`PRE_AGGREGATE_RESULTS_S3_ACCESS_KEY`	No	Access key for the bucket	`S3_ACCESS_KEY`
`PRE_AGGREGATE_RESULTS_S3_SECRET_KEY`	No	Secret key for the bucket	`S3_SECRET_KEY`

S3_ENDPOINT and S3_FORCE_PATH_STYLE are inherited from your base S3 configuration. Access keys fall back to the base S3 credentials if not set separately.

The credentials used for the pre-aggregate bucket must have s3:PutObject permission (in addition to s3:GetObject, s3:ListBucket, and s3:DeleteObject). Without PutObject, materialization jobs will fail to upload results to the bucket. On other S3-compatible storage (GCS, MinIO), grant the equivalent write permission.

AWS S3
Google Cloud Storage
MinIO

configMap:
  S3_ENDPOINT: "https://s3.us-east-1.amazonaws.com"
  PRE_AGGREGATE_RESULTS_S3_BUCKET: "my-lightdash-pre-aggs"
  PRE_AGGREGATE_RESULTS_S3_REGION: "us-east-1"
secrets:
  PRE_AGGREGATE_RESULTS_S3_ACCESS_KEY: "AKIA..."
  PRE_AGGREGATE_RESULTS_S3_SECRET_KEY: "..."

GCS is S3-compatible via HMAC keys. Generate an HMAC key pair in Cloud Storage > Settings > Interoperability.

configMap:
  S3_ENDPOINT: "https://storage.googleapis.com"
  PRE_AGGREGATE_RESULTS_S3_BUCKET: "my-lightdash-pre-aggs"
  PRE_AGGREGATE_RESULTS_S3_REGION: "auto"
secrets:
  PRE_AGGREGATE_RESULTS_S3_ACCESS_KEY: "GOOG..."
  PRE_AGGREGATE_RESULTS_S3_SECRET_KEY: "..."

configMap:
  S3_ENDPOINT: "https://minio.example.com"
  S3_FORCE_PATH_STYLE: "true"
  PRE_AGGREGATE_RESULTS_S3_BUCKET: "lightdash-pre-aggs"
  PRE_AGGREGATE_RESULTS_S3_REGION: "us-east-1"
secrets:
  PRE_AGGREGATE_RESULTS_S3_ACCESS_KEY: "minioadmin"
  PRE_AGGREGATE_RESULTS_S3_SECRET_KEY: "..."

We recommend setting a retention / lifecycle policy on the pre-aggregate bucket to automatically clean up old files. Lightdash manages its own materializations, but a lifecycle policy prevents orphaned files from accumulating. Choose a retention period that makes sense for your deployment.

Configuration reference

All configuration is set through your Helm values.yaml under preAggregateNatsWorker:

Scaling

Helm value	Default	Description
`preAggregateNatsWorker.replicas`	`1`	Number of worker pods. Scale horizontally for more parallel capacity.
`preAggregateNatsWorker.concurrency`	`100`	Maximum concurrent jobs per pod. Maps to `NATS_WORKER_CONCURRENCY` env var.

Resources

Helm value	Recommended (request)	Recommended (limit)	Description
`preAggregateNatsWorker.resources.requests.cpu`	`650m`	—	CPU request per pod
`preAggregateNatsWorker.resources.requests.memory`	`4Gi`	`4Gi`	Memory request and limit per pod
`preAggregateNatsWorker.resources.requests.ephemeral-storage`	`9Gi`	`9Gi`	Local disk for temporary files during materialization

Pre-aggregate workers need significantly more resources than warehouse workers because they run DuckDB in-process for both materializing data and serving queries against materialized data.

Ephemeral storage is critical. During materialization, warehouse query results are written to a local temporary file before being converted and uploaded to S3. Large materializations can consume several gigabytes of local disk. If the pod runs out of ephemeral storage, it will be evicted.

DuckDB memory tuning

DuckDB runs inside the pre-aggregate worker process. There are two types of DuckDB instances:

Instance type	Used for	Memory limit	Concurrency
Shared query instance	Serving pre-aggregate queries	Configurable (see below)	Shared across all concurrent queries
Isolated materialization instances	Converting and uploading results	256MB per instance, 1 thread	One per active materialization

By default, the shared query instance has no memory cap. Under concurrent load, this can cause OOM kills. Set a limit:

configMap:
  PRE_AGGREGATE_DUCKDB_QUERY_MEMORY_LIMIT: "3GB"

Sizing guideline: Start with 2–3GB and adjust based on observed memory usage. The limit should leave enough headroom for the Node.js process, active materializations, and OS overhead within the pod’s total memory.

Optional environment variables

These can be set via extraEnv or configMap if you need to override the defaults:

Variable	Default	Description
`NATS_QUEUE_TIMEOUT_MS`	`180000` (3 min)	How long a message can wait in the queue before being discarded.
`PRE_AGGREGATE_DUCKDB_QUERY_MEMORY_LIMIT`	Unlimited	Memory cap for the shared DuckDB query instance (e.g., `2GB`, `3GB`). When unset, DuckDB will use all available pod memory and is likely to cause OOM kills under concurrent load. We strongly recommend setting this.
`PRE_AGGREGATES_MAX_ROWS`	Unlimited	Maximum rows per materialization. Results are truncated to this limit with a warning. Can also be set per pre-aggregate in dbt YAML via `max_rows`.

Troubleshooting

Pre-aggregate queries hitting the warehouse instead of DuckDB

Verify that PRE_AGGREGATES_ENABLED is set to "true" on the pre-aggregate worker pod
Verify that PRE_AGGREGATES_PARQUET_ENABLED is set to "true" on the pre-aggregate worker pod
Confirm the pre-aggregate worker pod is running and healthy
Check that an active materialization exists in Project Settings > Pre-aggregates
Review query matching rules — see monitoring and debugging

Worker OOM kills

Pre-aggregate workers run DuckDB which can consume significant memory:

Set PRE_AGGREGATE_DUCKDB_QUERY_MEMORY_LIMIT (e.g., 3GB) to cap DuckDB memory
Increase the worker’s memory request and limit
Reduce concurrency to limit parallel DuckDB queries

Materialization failures

Common causes:

S3 access denied — Verify PRE_AGGREGATE_RESULTS_S3_* credentials and bucket permissions
Warehouse timeout — Large materializations may exceed warehouse query timeout limits
Disk pressure — Materialization writes temporary files to local disk. Increase ephemeral storage if you see evictions.
Too many rows — Materializations should not contain very large datasets. We recommend keeping materializations under 100,000 rows for optimal performance. You can use max_rows in your pre-aggregate definition or the PRE_AGGREGATES_MAX_ROWS environment variable to enforce a limit.

​Pre-aggregate materializations

​Pre-aggregate queries

​Prerequisites

​Example configuration

​S3 storage configuration

​Configuration reference

​Scaling

​Resources

​DuckDB memory tuning

​Optional environment variables

​Troubleshooting

​Pre-aggregate queries hitting the warehouse instead of DuckDB

​Worker OOM kills

​Materialization failures

Pre-aggregate materializations

Pre-aggregate queries

Prerequisites

Example configuration

S3 storage configuration

Configuration reference

Scaling

Resources

DuckDB memory tuning

Optional environment variables

Troubleshooting

Pre-aggregate queries hitting the warehouse instead of DuckDB

Worker OOM kills

Materialization failures