Lightdash can expose Prometheus metrics to help you monitor the performance and health of your Lightdash instance. This guide explains how to enable and configure Prometheus metrics for your self-hosted Lightdash deployment.

Enabling Prometheus metrics

By default, Prometheus metrics are disabled in Lightdash. To enable them, set the following environment variable:

LIGHTDASH_PROMETHEUS_ENABLED=true

Configuration options

You can customize the Prometheus metrics endpoint using the following environment variables:

VariableDescriptionRequired?Default
LIGHTDASH_PROMETHEUS_ENABLEDEnables/Disables Prometheus metrics endpointfalse
LIGHTDASH_PROMETHEUS_PORTPort for Prometheus metrics endpoint9090
LIGHTDASH_PROMETHEUS_PATHPath for Prometheus metrics endpoint/metrics
LIGHTDASH_PROMETHEUS_PREFIXPrefix for metric names
LIGHTDASH_GC_DURATION_BUCKETSBuckets for duration histogram in seconds0.001, 0.01, 0.1, 1, 2, 5
LIGHTDASH_EVENT_LOOP_MONITORING_PRECISIONPrecision for event loop monitoring in milliseconds. Must be greater than zero.10
LIGHTDASH_PROMETHEUS_LABELSLabels to add to all metrics. Must be valid JSON

Available metrics

Lightdash exposes the following metrics:

Process metrics

These metrics provide information about the Node.js process running Lightdash:

MetricTypeDescription
process_cpu_user_seconds_totalcounterTotal user CPU time spent in seconds
process_cpu_system_seconds_totalcounterTotal system CPU time spent in seconds
process_cpu_seconds_totalcounterTotal user and system CPU time spent in seconds
process_start_time_secondsgaugeStart time of the process since unix epoch in seconds
process_resident_memory_bytesgaugeResident memory size in bytes
process_virtual_memory_bytesgaugeVirtual memory size in bytes
process_heap_bytesgaugeProcess heap size in bytes
process_open_fdsgaugeNumber of open file descriptors
process_max_fdsgaugeMaximum number of open file descriptors

Node.js metrics

These metrics provide information about the Node.js runtime:

MetricTypeDescription
nodejs_eventloop_lag_secondsgaugeLag of event loop in seconds
nodejs_eventloop_lag_min_secondsgaugeThe minimum recorded event loop delay
nodejs_eventloop_lag_max_secondsgaugeThe maximum recorded event loop delay
nodejs_eventloop_lag_mean_secondsgaugeThe mean of the recorded event loop delays
nodejs_eventloop_lag_stddev_secondsgaugeThe standard deviation of the recorded event loop delays
nodejs_eventloop_lag_p50_secondsgaugeThe 50th percentile of the recorded event loop delays
nodejs_eventloop_lag_p90_secondsgaugeThe 90th percentile of the recorded event loop delays
nodejs_eventloop_lag_p99_secondsgaugeThe 99th percentile of the recorded event loop delays
nodejs_active_resourcesgaugeNumber of active resources that are currently keeping the event loop alive, grouped by async resource type
nodejs_active_resources_totalgaugeTotal number of active resources
nodejs_active_handlesgaugeNumber of active libuv handles grouped by handle type
nodejs_active_handles_totalgaugeTotal number of active handles
nodejs_active_requestsgaugeNumber of active libuv requests grouped by request type
nodejs_active_requests_totalgaugeTotal number of active requests
nodejs_heap_size_total_bytesgaugeProcess heap size from Node.js in bytes
nodejs_heap_size_used_bytesgaugeProcess heap size used from Node.js in bytes
nodejs_external_memory_bytesgaugeNode.js external memory size in bytes
nodejs_heap_space_size_total_bytesgaugeProcess heap space size total from Node.js in bytes
nodejs_heap_space_size_used_bytesgaugeProcess heap space size used from Node.js in bytes
nodejs_heap_space_size_available_bytesgaugeProcess heap space size available from Node.js in bytes
nodejs_version_infogaugeNode.js version info
nodejs_gc_duration_secondshistogramGarbage collection duration by kind
nodejs_eventloop_utilizationgaugeThe calculated Event Loop Utilization (ELU) as a percentage

PostgreSQL metrics

These metrics provide information about the PostgreSQL connection pool:

MetricTypeDescription
pg_pool_max_sizegaugeMax size of the PG pool
pg_pool_sizegaugeCurrent size of the PG pool
pg_active_connectionsgaugeNumber of active connections in the PG pool
pg_idle_connectionsgaugeNumber of idle connections in the PG pool
pg_queued_queriesgaugeNumber of queries waiting in the PG pool queue
pg_connection_acquire_timehistogramTime to acquire a connection from the PG pool in milliseconds
pg_query_durationhistogramHistogram of PG query execution time in milliseconds

Queue metrics

MetricTypeDescription
queue_sizegaugeNumber of jobs in the queue

Using metrics for monitoring and alerting

You can use these metrics to create dashboards and alerts in your monitoring system. Some common use cases include:

  • Monitoring memory usage and setting alerts for potential memory leaks
  • Tracking PostgreSQL connection pool utilization
  • Monitoring event loop lag to detect performance issues
  • Setting up alerts for high CPU usage

For example, you might want to create alerts for:

  • High memory usage: process_resident_memory_bytes > threshold
  • Event loop lag: nodejs_eventloop_lag_p99_seconds > threshold
  • Database connection pool saturation: pg_active_connections / pg_pool_max_size > 0.8

Setting up a Prometheus server

If you don’t already have a Prometheus server set up, here are some resources to help you get started:

General Prometheus setup

Setting up Prometheus in Google Cloud Platform (GCP)

Setting up Prometheus in Amazon Web Services (AWS)