Skip to content

Monitoring & Diagnostics

server:
diagnostics: 127.0.0.1:9090
log_level: info

The log_level field controls the verbosity of FastForward’s own stderr output. Supported values, from most verbose to least:

ValueWhen to use
traceDeep debugging of I/O loops and buffer internals (very noisy)
debugInvestigating specific pipeline behavior or connection issues
infoDefault. Startup, shutdown, config reload, and periodic summary lines
warnOnly warnings and errors (recommended for high-throughput production)
errorOnly unrecoverable or action-required errors

You can change the level at runtime by sending a PUT request:

Terminal window
curl -X PUT http://localhost:9090/admin/v1/log_level -H 'Content-Type: application/json' -d '"debug"'
EndpointDescription
GET /liveLiveness probe (process/control-plane only)
GET /readyReadiness probe (200 once initialized)
GET /admin/v1/statusCanonical rich status JSON (live, ready, component health, per-pipeline detail)
GET /admin/v1/statsFlattened JSON for polling/benchmarks
GET /admin/v1/configView active YAML configuration (disabled by default; enable with FFWD_UNSAFE_EXPOSE_CONFIG=1)
GET /admin/v1/logsView recent log lines from stderr
GET /admin/v1/historyTime-series data for dashboard charts
GET /admin/v1/tracesDetailed latency spans for recent batches
GET /HTML dashboard

Visiting http://<host>:9090/ in a browser opens a built-in, single-page HTML dashboard. It provides a live view of:

  • Pipeline throughput — lines/sec per input and output, with sparkline charts.
  • Flush behavior — ratio of size-triggered vs. timeout-triggered flushes.
  • Stage latency — time spent in scan, transform, and output stages.
  • Error counts — recent transport errors or dropped batches.

The dashboard pulls data from /admin/v1/history and refreshes automatically. No external dependencies or authentication are required. It is a useful first-look tool before diving into the JSON API or Grafana dashboards.

GET /admin/v1/status returns the canonical health payload. Here is an annotated example:

{
"live": { "status": "live" },
"ready": { "status": "ready" },
"pipelines": [
{
"name": "host-logs",
"inputs": [
{
"type": "file",
"lines_total": 184200,
"bytes_total": 52428800,
"errors_total": 0
}
],
"transform": {
"lines_in": 184200,
"lines_out": 91040,
"filter_drop_rate": 0.506
},
"outputs": [
{
"type": "otlp",
"lines_total": 91040,
"bytes_total": 11206400,
"errors_total": 0
}
]
}
],
"system": {
"uptime_seconds": 3621,
"version": "0.14.0"
}
}

Pipelines are returned as an array. Use jq '.pipelines[0]' to access the first pipeline, or filter by name with jq '.pipelines[] | select(.name == "host-logs")'.

FieldDescription
livetrue when the process is running and the control plane is healthy
readytrue once all pipelines have completed initialization
uptime_secsSeconds since the process started
versionFastForward binary version
pipelines.<name>.input.lines_totalTotal lines read by this input since startup
pipelines.<name>.input.bytes_totalTotal bytes read by this input
pipelines.<name>.input.errors_totalCumulative read errors (file permission, connection reset, etc.)
pipelines.<name>.input.transportTransport-specific metrics (see Transport Observability below)
pipelines.<name>.transform.lines_inLines entering the SQL transform stage
pipelines.<name>.transform.lines_outLines emitted after filtering
pipelines.<name>.transform.filter_drop_rateFraction of input lines dropped by the filter (higher = more aggressive filter)
pipelines.<name>.output.lines_totalLines successfully delivered to the destination
pipelines.<name>.output.errors_totalDelivery failures (connection errors, timeouts, HTTP 5xx)
pipelines.<name>.output.last_flush_age_secsSeconds since the last successful flush; useful for staleness alerts

The diagnostics WebSocket and JSON endpoints use the metric names below (dot-separated prefix, underscore-separated suffix). The OTLP push path (metrics_endpoint) uses fully underscore-separated names (e.g. ffwd_input_lines). See What gets pushed for details.

MetricDescription
ffwd.input_linesTotal lines read across all inputs
ffwd.input_bytesTotal bytes read across all inputs
ffwd.output_linesTotal lines delivered to outputs
ffwd.output_bytesTotal bytes delivered to outputs
ffwd.output_errorsCumulative output delivery errors
ffwd.stage_nanosTime spent in the scan/parse stage (ns)
ffwd.stage_nanos (transform)Time spent in the SQL transform stage (ns)
ffwd.stage_nanos (output)Time spent serializing output batches (ns)
ffwd.send_nanosTime spent transmitting batches to the destination (ns)
ffwd.queue_wait_nanosTime a batch waited in the channel before processing (ns)
ffwd.batchesTotal batches flushed
ffwd.batch_rowsTotal rows across all flushed batches
ffwd.dropped_batchesBatches discarded due to scan, transform, or output errors
ffwd.backpressure_stallsTimes input stalled on a full channel
ffwd.inflight_batchesBatches currently in-flight

The /admin/v1/status endpoint includes a transport object inside each input’s JSON representation containing specific metrics for that transport type:

  • File: exposes consecutive_error_polls, representing the current file-tail pressure and backoff state.
  • TCP: exposes accepted_connections (total accepted) and active_connections (currently connected clients) indicators.
  • UDP: exposes drops_detected (datagrams dropped due to kernel buffer overflows) and recv_buffer_size (actual kernel receive buffer size applied) indicators.

Use the status endpoint and OTLP metrics to build alerts for the conditions that matter most. The table below lists recommended thresholds as starting points — adjust them based on your traffic patterns and SLOs.

ConditionMetric / checkSuggested thresholdSeverity
Process down/live returns non-2002 consecutive failures (30 s apart)Critical
Not ready/ready returns non-200> 60 s after container startWarning
Output stalelast_flush_age_secs> 120 sCritical
Delivery errorsoutput.errors_total rate> 0 sustained for 5 minWarning
Input errorsinput.errors_total rate> 0 sustained for 5 minWarning
High drop ratetransform.filter_drop_rate> 0.99 (dropping >99% of lines)Info
Memory pressureContainer memory usage> 85 % of limitWarning
CPU saturationffwd.stage_nanos rateApproaching --cpus limitWarning
UDP dropstransport.drops_detected rate> 0 sustained for 2 minWarning

In addition to the pull-based diagnostics API, FastForward can push its own internal metrics to an OpenTelemetry Collector over OTLP/HTTP.

server:
metrics_endpoint: https://otel-collector:4318
metrics_interval_secs: 60
FieldDescription
metrics_endpointURL of the OTLP HTTP receiver (typically port 4318)
metrics_interval_secsHow often FastForward pushes a metrics batch (default: 60)

The counters and histograms listed in the Key metrics table above are exported as OTLP metrics over HTTP. The push path uses the OpenTelemetry SDK, which registers instruments with underscore-separated names (e.g. ffwd_input_lines, ffwd_output_bytes). Each metric carries pipeline, input, or output metric attributes so you can filter by component. The payload uses OTLP protobuf encoding.

Terminal window
# 1. Confirm FastForward is sending metrics (look for export lines in debug logs)
docker logs ffwd 2>&1 | grep -i "metrics export"
# 2. Query the collector's own metrics to see ingest counts
curl -s http://otel-collector:8888/metrics | grep otelcol_receiver_accepted_metric_points
# 3. Check the status endpoint for push errors
curl -s http://localhost:9090/admin/v1/status | jq '.metrics_push'