Configuration Reference

FastForward commands that operate on pipeline config (for example run, validate, dry-run, and effective-config) accept a YAML file via --config <config.yaml>. ff send accepts a destination-only config with top-level output, injects stdin as the input, drains the output, and exits.

Overview

FastForward pipeline configs use a top-level pipelines map. Each named pipeline defines inputs, optional transform, optional enrichment, optional resource_attrs, and outputs.

Environment variables are expanded using ${VAR} syntax anywhere in the file. If a variable is not set, config loading fails fast with a validation error.

Config layout

pipelines:
  errors:
    inputs:
      - name: pod_logs
        type: file
        path: /var/log/pods/**/*.log
        format: cri
    transform: SELECT * FROM logs WHERE level = 'ERROR'
    outputs:
      - type: otlp
        endpoint: http://otel-collector:4318/v1/logs

  debug:
    inputs:
      - type: file
        path: /var/log/pods/**/*.log
        format: cri
    outputs:
      - type: stdout
        format: json

server:
  diagnostics: 0.0.0.0:9090

Input configuration

Each pipeline requires at least one input. Use a single mapping for one input or a YAML sequence for multiple inputs.

Common fields

Field	Type	Required	Description
`type`	string	Yes	Input type. See Input types.
`name`	string	No	Friendly name shown in diagnostics.
`format`	string	No	Log format. See Formats. Defaults to `auto`.
`source_metadata`	string	No	Source metadata style. Defaults to `none`. Use `fastforward` for internal `__source_id`, `ecs` for public ECS columns such as `file.path`, `otel` for public OpenTelemetry columns such as `log.file.path`, or `vector` for public Vector-style columns such as `file`. Public source path styles require inputs that expose source path snapshots. File inputs expose filesystem paths; S3 inputs expose object keys.

`file` input

Tail one or more log files that match a glob pattern.

Field	Type	Required	Description
`path`	string	Yes	Glob pattern, e.g. `/var/log/pods/*/.log`.
`poll_interval_ms`	integer	No	How often to poll the file when tailing, in milliseconds (default: 50).
`read_buf_size`	integer	No	Buffer size for file reads in bytes (default: 262144, max: 4194304).
`per_file_read_budget_bytes`	integer	No	Maximum bytes read per file per poll (default: 262144).
`adaptive_fast_polls_max`	integer	No	Immediate repoll budget after a read-budget hit (default: 8, set `0` to disable adaptive fast repolls).

input:
  type: file
  path: /var/log/pods/**/*.log
  format: cri

`s3` input

Read objects from AWS S3 or an S3-compatible endpoint. The S3 input can poll ListObjectsV2 by prefix, or process object notifications from SQS when sqs_queue_url is set. S3 support is behind the s3 feature.

Field	Type	Required	Description
`s3.bucket`	string	Yes	Bucket name.
`s3.region`	string	No	AWS region. Defaults to `us-east-1`.
`s3.endpoint`	string	No	S3-compatible endpoint URL, for example `http://localhost:9000`. Path-style addressing is used when set.
`s3.prefix`	string	No	Only process object keys with this prefix.
`s3.sqs_queue_url`	string	No	SQS queue URL for event-driven object discovery. Omit to poll the bucket prefix.
`s3.start_after`	string	No	Initial `ListObjectsV2` `StartAfter` key for prefix polling.
`s3.access_key_id`	string	No	AWS access key ID. Falls back to `AWS_ACCESS_KEY_ID`.
`s3.secret_access_key`	string	No	AWS secret access key. Falls back to `AWS_SECRET_ACCESS_KEY`.
`s3.session_token`	string	No	AWS session token. Falls back to `AWS_SESSION_TOKEN`.
`s3.part_size_bytes`	integer	No	Range-GET part size. Defaults to 1048576.
`s3.max_concurrent_fetches`	integer	No	Maximum concurrent range GETs per object. Defaults to 16.
`s3.max_concurrent_objects`	integer	No	Maximum objects fetched at once. Defaults to 4.
`s3.visibility_timeout_secs`	integer	No	SQS visibility timeout. Defaults to 300 and must be at least 30.
`s3.compression`	enum	No	Compression override: `auto`, `gzip`, `zstd`, `snappy`, or `none`. Defaults to `auto`.
`s3.poll_interval_ms`	integer	No	Prefix polling interval in milliseconds. Defaults to 5000.

When source_metadata is ecs, otel, or vector, S3 exposes the object key as the source path value for each row. fastforward only attaches the internal __source_id, and none attaches no source metadata.

input:
  type: s3
  format: json
  source_metadata: ecs
  s3:
    bucket: app-logs
    region: us-east-1
    prefix: prod/

`stdin` input

Read data from standard input. This input has no input-specific fields. It is primarily used by ff send for command-line piping.

input:
  type: stdin
  format: auto

Supported formats: auto, cri, json, and raw.

`generator` input

Emit synthetic records for benchmarks, demos, and pipeline tests.

Field	Type	Required	Description
`generator`	object	No	Generator settings. If omitted, runtime defaults are used, including `batch_size: 1000` and `events_per_sec: 0`. See Input Types for the detailed nested fields.
`generator.profile`	string	No	Generator profile: `logs` (default), `record`, `envoy`, `cri_k8s`, `wide`, `narrow`, `cloud_trail`. See Generator profiles.

input:
  type: generator
  generator:
    events_per_sec: 50000
    batch_size: 4096

`udp` input

Listen for log lines on a UDP socket.

Field	Type	Required	Description
`listen`	string	Yes	`host:port`, e.g. `0.0.0.0:514`.

input:
  type: udp
  listen: 0.0.0.0:514
  format: json

`tcp` input

Accept log lines on a TCP socket.

Field	Type	Required	Description
`listen`	string	Yes	`host:port`, e.g. `0.0.0.0:5140`.
`tls`	object	No	Optional server TLS options (`cert_file`, `key_file`, `client_ca_file`, `require_client_auth`). `client_ca_file` is valid only when `require_client_auth: true`.

input:
  type: tcp
  listen: 0.0.0.0:5140
  format: json

`otlp` input

Receive OTLP log records from another agent or SDK.

Field	Type	Required	Description
`listen`	string	Yes	`host:port`, e.g. `0.0.0.0:4318`.
`protobuf_decode_mode`	string	No	Experimental protobuf decoder: `prost` (default), `projected_fallback`, or `projected_only`. Projected modes require a build with the `otlp-research` feature.

input:
  type: otlp
  listen: 0.0.0.0:4318
  protobuf_decode_mode: prost

`http` input

Receive newline-delimited payloads over HTTP POST.

Field	Type	Required	Description
`listen`	string	Yes	`host:port`, e.g. `0.0.0.0:8081`.
`http.path`	string	No	Route path. Must start with `/`. Defaults to `/`.
`http.strict_path`	boolean	No	When `true` (default), require exact path match.
`http.method`	string	No	Accepted method. Defaults to `POST`.
`http.max_request_body_size`	integer	No	Maximum request body size in bytes. Defaults to 10 MiB.
`http.max_drained_bytes_per_poll`	integer	No	Maximum bytes drained from the internal request queue per poll. Defaults to 1 GiB.
`http.response_code`	integer	No	Success code. One of `200`, `201`, `202`, `204` (default `200`).
`http.response_body`	string	No	Optional static success response body. Not allowed when `http.response_code: 204`.

input:
  type: http
  listen: 0.0.0.0:8081
  format: json
  http:
    path: /ingest
    strict_path: true
    method: POST
    max_request_body_size: 10485760
    max_drained_bytes_per_poll: 1073741824
    response_code: 200
    response_body: '{"ok":true}'

`linux_ebpf_sensor` input

Linux eBPF sensor input for platform-native ingestion. This input is Arrow-native and does not support format.

Field	Type	Required	Description
`sensor.poll_interval_ms`	integer	No	Periodic sample cadence in milliseconds. Must be `>= 1`. Defaults to `10000`.
`sensor.control_path`	string	No	Optional JSON control-plane file path for runtime reload.
`sensor.control_reload_interval_ms`	integer	No	Reload check interval in milliseconds. Must be `>= 1`. Defaults to `1000`.
`sensor.enabled_families`	array[string]	No	Optional enabled signal families for this target (`process,file,network,dns,authz` on Linux). Omit to use defaults; set `[]` to disable all families.
`sensor.emit_signal_rows`	boolean	No	Emit periodic per-family sample rows. Defaults to `true`.

input:
  type: linux_ebpf_sensor
  sensor:
    poll_interval_ms: 2000

`macos_es_sensor` input

macOS EndpointSecurity sensor input. This input is Arrow-native and does not support format.

Field	Type	Required	Description
`sensor.poll_interval_ms`	integer	No	Periodic sample cadence in milliseconds. Must be `>= 1`. Defaults to `10000`.
`sensor.control_path`	string	No	Optional JSON control-plane file path for runtime reload.
`sensor.control_reload_interval_ms`	integer	No	Reload check interval in milliseconds. Must be `>= 1`. Defaults to `1000`.
`sensor.enabled_families`	array[string]	No	Optional enabled signal families (`process,file,network,dns,module,authz` on macOS). Omit to use defaults; set `[]` to disable all families.
`sensor.emit_signal_rows`	boolean	No	Emit periodic per-family sample rows. Defaults to `true`.

input:
  type: macos_es_sensor

`windows_ebpf_sensor` input

Windows eBPF sensor input. This input is Arrow-native and does not support format.

Field	Type	Required	Description
`sensor.poll_interval_ms`	integer	No	Periodic sample cadence in milliseconds. Must be `>= 1`. Defaults to `10000`.
`sensor.control_path`	string	No	Optional JSON control-plane file path for runtime reload.
`sensor.control_reload_interval_ms`	integer	No	Reload check interval in milliseconds. Must be `>= 1`. Defaults to `1000`.
`sensor.enabled_families`	array[string]	No	Optional enabled signal families (`process,file,network,dns,module,registry,authz` on Windows). Omit to use defaults; set `[]` to disable all families.
`sensor.emit_signal_rows`	boolean	No	Emit periodic per-family sample rows. Defaults to `true`.

input:
  type: windows_ebpf_sensor

`macos_log` input

Read macOS unified log entries by running log stream. This input is only available on macOS and emits structured rows parsed from the command output.

Field	Type	Required	Description
`macos_log.level`	string	No	Optional log level filter. Must not be empty when set.
`macos_log.subsystem`	string	No	Optional subsystem filter. Must not be empty when set.
`macos_log.process`	string	No	Optional process filter. Must not be empty when set.

pipelines:
  default:
    inputs:
      - type: macos_log
        macos_log:
          level: info
          subsystem: com.example.app
    outputs:
      - type: stdout
        format: json

`journald` input

Read structured entries from the systemd journal using either the native sd_journal C API or a journalctl subprocess.

Field	Type	Required	Description
`journald.include_units`	list	No	Systemd units to include. Suffix `.service` automatically if omitted.
`journald.exclude_units`	list	No	Systemd units to exclude.
`journald.identifiers`	list	No	Syslog identifiers (`SYSLOG_IDENTIFIER=`) to include.
`journald.priorities`	list	No	Priority/log levels to include (e.g. `0`, `3`, `info`, `err`).
`journald.cursor_path`	string	No	Path to persist cursor for resume after restarts.
`journald.include_boot_id`	bool	No	Include `_BOOT_ID` field (default: false).
`journald.current_boot_only`	bool	No	Only include entries from the current boot (default: true).
`journald.since_now`	bool	No	Only include entries appended after start (default: false).
`journald.journalctl_path`	string	No	Path to `journalctl` binary. Defaults to `journalctl` on PATH.
`journald.journal_directory`	string	No	Custom journal directory (`--directory=<path>`).
`journald.journal_namespace`	string	No	Journal namespace (`--namespace=<ns>`).
`journald.backend`	enum	No	`auto` (default), `native` (require sd_journal API), or `subprocess` (always use journalctl).

pipelines:
  default:
    inputs:
      - type: journald
        journald:
          include_units:
            - nginx
            - redis
          priorities:
            - err
            - warning
          cursor_path: /var/lib/ffwd/journald.cursor

`host_metrics` input

Host metrics input that collects process snapshots, CPU, memory, and network statistics via sysinfo. This input is Arrow-native and does not support format. The OS-specific implementation is selected at compile time based on the build target.

Field	Type	Required	Description
`sensor.poll_interval_ms`	integer	No	Periodic sample cadence in milliseconds. Must be `>= 1`. Defaults to `10000`.
`sensor.control_path`	string	No	Optional JSON control-plane file path for runtime reload.
`sensor.control_reload_interval_ms`	integer	No	Reload check interval in milliseconds. Must be `>= 1`. Defaults to `1000`.
`sensor.enabled_families`	array[string]	No	Optional enabled signal families. Omit to use defaults; set `[]` to disable all families.
`sensor.emit_signal_rows`	boolean	No	Emit periodic per-family sample rows. Defaults to `true`.
`sensor.max_rows_per_poll`	integer	No	Upper bound on data rows returned per collection cycle. Defaults to `256`. Set to `0` or omit for the default.
`sensor.max_process_rows_per_poll`	integer	No	Upper bound on process snapshot rows returned per collection cycle. Defaults to `1024`. Set to `0` or omit for the default.
`sensor.scrapers`	array[string]	No	List of scrapers to run. Supported values are: `cpu`, `memory`, `disk`, `network`, `filesystem`.
`sensor.collection_interval_ms`	integer	No	Metrics collection cadence in milliseconds. Defaults to `10000`.
`sensor.disk_include_devices`	array[string]	No	Optional list of disk devices to include in scraping.
`sensor.disk_exclude_devices`	array[string]	No	Optional list of disk devices to exclude from scraping.
`sensor.network_include_interfaces`	array[string]	No	Optional list of network interfaces to include in scraping.
`sensor.network_exclude_interfaces`	array[string]	No	Optional list of network interfaces to exclude from scraping.
`sensor.filesystem_include_mount_points`	array[string]	No	Optional list of filesystem mount points to include in scraping.
`sensor.filesystem_exclude_mount_points`	array[string]	No	Optional list of filesystem mount points to exclude from scraping.

input:
  type: host_metrics
  sensor:
    poll_interval_ms: 5000
    collection_interval_ms: 5000
    scrapers: ["cpu", "memory"]

`arrow_ipc` input

Receive Arrow IPC stream payloads over HTTP POST and forward decoded RecordBatch values directly into the pipeline (scanner bypass).

Field	Type	Required	Description
`listen`	string	Yes	`host:port`, e.g. `0.0.0.0:4319`.

Behavior:

Route is fixed to POST /v1/arrow for MVP.
arrow_ipc is Arrow-native and rejects format.
Canonical payload types are application/vnd.apache.arrow.stream and application/vnd.apache.arrow.stream+zstd.
Content-Encoding: zstd is also supported for compressed Arrow stream payloads.
The receiver currently decodes by payload bytes and may still accept requests with missing/other content-type headers; use canonical content types for predictable interoperability.

Input types

Value	Status	Description
`file`	Implemented	Tail files matching a glob pattern.
`s3`	Implemented	Read objects from AWS S3 or an S3-compatible endpoint.
`stdin`	Implemented	Read piped stdin until EOF, then drain outputs and exit.
`generator`	Implemented	Emit synthetic JSON-like records from an in-process source.
`udp`	Implemented	Receive log lines over UDP.
`tcp`	Implemented	Accept log lines over TCP.
`otlp`	Implemented	Receive OTLP logs over a bound listen address.
`http`	Implemented	Receive newline-delimited payloads via HTTP `POST`.
`linux_ebpf_sensor`	Implemented	Linux eBPF sensor input (Arrow-native control + signal rows).
`macos_es_sensor`	Implemented	macOS EndpointSecurity sensor input (Arrow-native control + signal rows).
`macos_log`	Implemented	Read macOS unified log entries from the `log stream` command.
`windows_ebpf_sensor`	Implemented	Windows eBPF sensor input (Arrow-native control + signal rows).
`journald`	Beta	Read structured journal entries from systemd journald.
`host_metrics`	Implemented	Host metrics input — process snapshots, CPU, memory, network stats via sysinfo (Arrow-native).
`arrow_ipc`	Implemented	Receive Arrow IPC stream batches via HTTP `POST /v1/arrow`.

Formats

The format field controls how raw bytes from the input are parsed into log records. linux_ebpf_sensor, macos_es_sensor, windows_ebpf_sensor, and arrow_ipc are Arrow-native and reject format.

Value	Description
`auto`	Auto-detect (default). Tries CRI first, then JSON, then raw.
`cri`	CRI container log format (`<timestamp> <stream> <flags> <message>`). Multi-line log reassembly via the `P` partial flag is supported.
`json`	Newline-delimited JSON. Each line must be a single JSON object.
`raw`	Treat each line as an opaque string stored in `body`.
`logfmt`	Key=value pairs (e.g. `level=info msg="hello"`). Not yet implemented.
`console`	Human-readable coloured output for interactive debugging. Output mode only.

Output configuration

Each pipeline requires at least one output.

Common fields

Field	Type	Required	Description
`type`	string	Yes	Output type. See Output types.
`name`	string	No	Friendly name shown in diagnostics.

`otlp` output

Send log records as OTLP protobuf to an OpenTelemetry collector.

Field	Type	Required	Default	Description
`endpoint`	string	Yes	—	Full collector URL, e.g. `http://otel-collector:4317` (gRPC) or `http://otel-collector:4318/v1/logs` (HTTP).
`protocol`	enum	No	`http`	`http` or `grpc`. Invalid values are rejected while parsing config.
`compression`	enum	No	`none`	`zstd`, `gzip`, or `none` for the request body. Invalid values are rejected while parsing config.
`auth`	object	No	—	Optional bearer token or custom headers for HTTP auth.
`tls`	object	No	—	Optional TLS client options (`ca_file`, `cert_file`, `key_file`, `insecure_skip_verify`) for HTTPS endpoints.
`headers`	map[string,string]	No	—	Additional static HTTP headers to send with every export request.
`retry_attempts`	integer	No	—	Maximum export retry attempts.
`retry_initial_backoff_ms`	integer	No	—	Initial backoff delay in milliseconds.
`retry_max_backoff_ms`	integer	No	—	Maximum backoff delay in milliseconds.
`request_timeout_ms`	integer	No	—	Export request timeout in milliseconds.
`batch_size`	integer	No	—	Maximum rows per OTLP request.
`batch_timeout_ms`	integer	No	—	Maximum time to buffer rows before exporting.

pipelines:
  default:
    inputs:
      - type: stdin
        format: json
    outputs:
      - type: otlp
        endpoint: http://otel-collector:4317
        protocol: grpc
        compression: zstd

`http` output

Send newline-delimited JSON rows to an HTTP endpoint with optional request-body compression and auth headers.

Field	Type	Required	Description
`endpoint`	string	Yes	Full URL, e.g. `http://ingest.example.com/logs`.
`format`	enum	No	Must be `json` when set.
`compression`	enum	No	`zstd`, `gzip`, or `none`.
`auth`	object	No	Optional bearer token or custom headers for HTTP auth.

pipelines:
  default:
    inputs:
      - type: stdin
        format: json
    outputs:
      - type: http
        endpoint: https://ingest.example.com/logs
        compression: zstd

`stdout` output

Print records to standard output for local debugging.

Field	Type	Required	Default	Description
`format`	string	No	`console`	`json` (newline-delimited JSON), `console` (coloured text), or `text` (raw text).

output:
  type: stdout
  format: console

`elasticsearch` output

Ship to Elasticsearch via the Bulk API.

Field	Type	Required	Default	Description
`endpoint`	string	Yes	—	Elasticsearch base URL.
`index`	string	No	`logs`	Target index name. Must not be empty, and must not contain Elasticsearch-reserved characters or prefixes.
`compression`	enum	No	`none`	`gzip` or `none`. `zstd` is rejected for Elasticsearch by validation.
`request_mode`	enum	No	`buffered`	`buffered` or `streaming`. Invalid values are rejected while parsing config; `streaming` currently requires `compression: none`.
`request_timeout_ms`	integer	No	`30000`	HTTP request timeout in milliseconds. Must be >= 1 when set.
`tls`	object	No	—	Optional TLS client options (`ca_file`, `cert_file`, `key_file`, `insecure_skip_verify`) for HTTPS endpoints.
`auth`	object	No	—	Optional bearer token or custom headers for HTTP auth.
`retry`	object	No	—	Optional retry configuration (`max_attempts`, `initial_backoff_secs`, `max_backoff_secs`).
`batch`	object	No	—	Optional batching configuration (`max_bytes`, `max_events`, `timeout_secs`).

Bulk payloads are split before they exceed 5242880 bytes (5 MiB). That limit is internal and is not a YAML field.

output:
  type: elasticsearch
  endpoint: https://es-cluster:9200
  index: logs
  compression: gzip
  request_mode: buffered
  auth:
    bearer_token: "${ES_TOKEN}"

`loki` output

Push to Grafana Loki.

Field	Type	Required	Description
`endpoint`	string	Yes	Loki base URL. The `/loki/api/v1/push` path is appended automatically.
`request_timeout_ms`	integer	No	HTTP request timeout in milliseconds (default: `30000`). Must be >= 1 when set.
`tls`	object	No	Optional TLS client options (`ca_file`, `cert_file`, `key_file`, `insecure_skip_verify`) for HTTPS endpoints.
`tenant_id`	string	No	Optional value sent as `X-Scope-OrgID` for multi-tenant Loki deployments.
`static_labels`	map[string,string]	No	Static labels applied to every pushed log stream. Keys and values must be non-empty.
`label_columns`	array[string]	No	Additional log columns to promote as Loki labels.
`auth`	object	No	Optional bearer token or custom headers for HTTP auth.
`retry`	object	No	Optional retry configuration (`max_attempts`, `initial_backoff_secs`, `max_backoff_secs`).
`batch`	object	No	Optional batching configuration (`max_bytes`, `max_events`, `timeout_secs`).

output:
  type: loki
  endpoint: http://loki:3100
  tenant_id: team-a
  static_labels:
    app: ffwd
    env: prod
  label_columns: [service, level]
  auth:
    bearer_token: "${LOKI_TOKEN}"

compression is not supported for Loki outputs.

`file` output

Write records to a file.

Field	Type	Required	Description
`path`	string	Yes	Destination file path. Parent directory must already exist and be writable.
`format`	string	No	`json` for NDJSON output, or `text` to write raw lines.

output:
  type: file
  path: /var/log/ffwd/capture.ndjson
  format: json

`tcp` output

Send newline-delimited JSON records to a TCP endpoint.

Field	Type	Required	Description
`endpoint`	string	Yes	Host:port destination (for example `tcp.example.com:9000`).

TCP output currently emits newline-delimited JSON. encoding, framing, tls, keepalive, timeout_secs, retry, and batch are rejected until the sink runtime implements them.

`udp` output

Send newline-delimited JSON records as UDP datagrams.

Field	Type	Required	Description
`endpoint`	string	Yes	Host:port destination (for example `udp.example.com:514`).

UDP output currently emits newline-delimited JSON using the built-in datagram size. encoding and max_datagram_size_bytes are rejected until the sink runtime implements them.

`null` output

Drop records intentionally for tests and benchmark baselines. The type value must be quoted as a YAML string; unquoted type: null is YAML’s null value and is rejected. Null outputs do not accept sink-specific fields such as endpoint, format, auth, tls, retry, or batch controls.

output:
  type: "null"

Output types

Value	Status	Description
`otlp`	Implemented	OTLP protobuf over HTTP or gRPC.
`http`	Implemented	POST newline-delimited JSON rows to an HTTP endpoint.
`stdout`	Implemented	Print to stdout (JSON, console, or text).
`elasticsearch`	Implemented	Elasticsearch Bulk API with index/compression/request-mode controls.
`loki`	Implemented	Grafana Loki push API with label grouping.
`file`	Implemented	Write NDJSON or text to a local file.
`null`	Implemented	Drop records intentionally for tests and benchmark baselines.
`tcp`	Implemented	Send records to a TCP endpoint.
`udp`	Implemented	Send records to a UDP endpoint.
`arrow_ipc`	Implemented	Send Arrow IPC payloads to an HTTP endpoint.

SQL transform

The optional transform field contains a DataFusion SQL query that is applied to every Arrow RecordBatch produced by the scanner. The source table is always named logs.

transform: SELECT level, message, status FROM logs WHERE status >= 400

Multi-line SQL is supported with YAML block scalars:

transform: |
  SELECT
    level,
    message,
    regexp_extract(message, 'request_id=([a-f0-9-]+)', 1) AS request_id,
    status
  FROM logs
  WHERE level IN ('ERROR', 'WARN')
    AND status >= 400

Column naming convention

The scanner maps each JSON field to a typed Arrow column using the field’s base name (no type suffix):

JSON value type	Arrow column type	Column name	Example
String	StringArray	`{field}`	`level`
Integer	Int64Array	`{field}`	`status`
Float	Float64Array	`{field}`	`latency_ms`
Boolean	StringArray (`"true"`/`"false"`)	`{field}`	`enabled`
Null	null in column	`{field}`	—
Object / Array	StringArray (raw JSON)	`{field}`	`metadata`

When a field contains mixed types across rows, the scanner emits a single Struct column under the field’s base name containing one child per observed type (e.g., a status Struct with int and str children). Legacy single-underscore suffixed columns (status_int, level_str) are not emitted.

Special columns attached by the runtime after scan, plus format-derived columns:

Column	Type	Description
`body`	string	Original input line (when input line capture is enabled, e.g. `line_field: body`, or when a non-JSON CRI line is wrapped for scanner safety).
`__source_id`	uint64	FastForward internal row-level source identity when `source_metadata: fastforward` is set. SQL can reference it, but user-facing sinks drop this internal column unless SQL aliases it to a public name.
`file.path`	string	ECS-style source file path when `source_metadata: ecs` is set. Quote it in SQL as `"file.path"`.
`log.file.path`	string	OpenTelemetry-style source file path when `source_metadata: otel` is set. Quote it in SQL as `"log.file.path"`.
`file`	string	Vector-style source file path when `source_metadata: vector` is set.
`_timestamp`	string	Timestamp from the CRI header as an RFC 3339 string (CRI inputs only).
`_stream`	string	CRI stream name (`stdout` / `stderr`).

Source metadata is never written into raw input bytes. It is carried beside scanner-ready chunks and, when source_metadata is not none, materialized as table columns before SQL runs. SQL does no hidden pruning or widening: SELECT * returns the columns that exist in the table. User-facing sinks drop known FastForward internal columns such as __source_id by default; alias an internal column to a public name when it should be emitted. User payload fields that happen to start with __ are not treated as internal. Public source path styles currently require inputs that expose source path snapshots. File inputs use filesystem paths; S3 inputs use object keys. Use fastforward for source identity on inputs that do not expose public source descriptors.

Built-in UDFs

Function	Signature	Description
`int(expr)`	`int(any) → int64`	Cast any value to int64. Returns NULL on failure.
`float(expr)`	`float(any) → float64`	Cast any value to float64. Returns NULL on failure.
`grok(input, pattern)`	`grok(utf8, utf8) → Struct`	Apply a Grok pattern to `input` and return the captures as a struct.
`regexp_extract(input, pattern, group)`	`regexp_extract(utf8, utf8, int64) → utf8`	Return capture group `group` from a regex match.

Examples:

-- Cast a string column to int
SELECT int(status) AS status FROM logs

-- Extract a field with Grok
SELECT grok(message, '%{IP:client} %{WORD:method} %{URIPATHPARAM:path}') AS parsed FROM logs

-- Extract a named group with regex
SELECT regexp_extract(message, 'user=([a-z]+)', 1) AS user FROM logs

-- Type-cast from environment-injected string
SELECT float(duration) AS duration_ms FROM logs

Enrichment tables

Enrichment tables are one-row (or multi-row) Arrow tables registered in DataFusion alongside the logs table. Use CROSS JOIN for one-row tables or LEFT JOIN for multi-row lookup tables.

enrichment:
  - type: host_info
  - type: process_info
  - type: network_info
  - type: container_info
  - type: k8s_cluster_info
  - type: k8s_path
  - type: static
    table_name: labels
    labels:
      environment: production
      region: us-east-1
  - type: kv_file
    table_name: os_release
    path: /etc/os-release
  - type: env_vars
    table_name: deploy_meta
    prefix: FFWD_META_
  - type: csv
    table_name: assets
    path: /etc/ffwd/assets.csv
  - type: jsonl
    table_name: ip_owners
    path: /etc/ffwd/ip-owners.jsonl
  - type: geo_database
    format: mmdb
    path: /data/GeoLite2-City.mmdb

`host_info` enrichment

System host metadata, resolved once at startup. Fixed table name: host_info.

Extended fields are sourced from /etc/os-release, /proc/sys/kernel/osrelease, /etc/machine-id, and /proc/sys/kernel/random/boot_id on Linux; they degrade to empty strings on other platforms.

Field	Description
`style`	Column naming convention: `raw` (default), `ecs` / `beats`, or `otel`.

enrichment:
  - type: host_info
    style: ecs          # Use ECS/Beats dotted column names

Column names by style:

Semantic	`raw` (default)	`ecs` / `beats`	`otel`
hostname	`hostname`	`host.hostname`	`host.name`
OS type	`os_type`	`host.os.type`	`os.type`
architecture	`os_arch`	`host.architecture`	`host.arch`
OS name	`os_name`	`host.os.name`	`os.name`
OS family	`os_family`	`host.os.family`	`os.family`
OS version	`os_version`	`host.os.version`	`os.version`
kernel	`os_kernel`	`host.os.kernel`	`os.kernel`
machine id	`host_id`	`host.id`	`host.id`
boot id	`boot_id`	`host.boot.id`	`host.boot.id`

SELECT l.*, h.hostname, h.os_type, h.os_name, h.os_kernel
FROM logs l CROSS JOIN host_info h

`process_info` enrichment

Agent self-metadata, resolved once at startup. Fixed table name: process_info.

Column	Description
`agent_name`	Always `ffwd`.
`agent_version`	Semantic version of the running binary.
`pid`	Process ID (as string).
`start_time`	ISO 8601 UTC timestamp captured when the `process_info` enrichment table is constructed during pipeline startup.

SELECT l.*, p.agent_version, p.start_time
FROM logs l CROSS JOIN process_info p

`network_info` enrichment

Network interface metadata from procfs, resolved once at startup. Fixed table name: network_info.

Column	Description
`hostname`	System hostname.
`primary_ipv4`	Lexicographically first non-loopback IPv4 address, or empty. On multihomed hosts this may not match the default-route interface; use `all_ipv4` for full coverage.
`primary_ipv6`	Lexicographically first non-loopback, non-link-local IPv6 address, or empty. Same caveat as `primary_ipv4`.
`all_ipv4`	Comma-separated list of all non-loopback IPv4 addresses.
`all_ipv6`	Comma-separated list of all non-loopback, non-link-local IPv6 addresses.

SELECT l.*, n.primary_ipv4
FROM logs l CROSS JOIN network_info n

`container_info` enrichment

Container runtime detection from /proc/self/cgroup and /.dockerenv, resolved once at startup. Fixed table name: container_info.

Column	Description
`container_id`	64-character hex container ID, or empty if not in a container.
`container_runtime`	`docker`, `containerd`, `cri-o`, `kubernetes`, `unknown`, or empty.

SELECT l.*, c.container_id, c.container_runtime
FROM logs l CROSS JOIN container_info c

`k8s_cluster_info` enrichment

Kubernetes cluster metadata from the downward API environment variables, resolved once at startup. Fixed table name: k8s_cluster_info.

Populate these via fieldRef in your DaemonSet pod spec:

env:
  - name: K8S_NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  - name: K8S_CLUSTER_NAME
    value: my-cluster   # or from a ConfigMap
  - name: K8S_SERVICE_ACCOUNT
    valueFrom:
      fieldRef:
        fieldPath: spec.serviceAccountName

Column	Description
`namespace`	Pod namespace (from mounted service account path).
`pod_name`	Pod name (from `HOSTNAME`).
`node_name`	Node name (from `K8S_NODE_NAME` or `NODE_NAME`).
`service_account`	Service account name (from `K8S_SERVICE_ACCOUNT` or `SERVICE_ACCOUNT` env var).
`cluster_name`	Cluster name (from `K8S_CLUSTER_NAME` or `CLUSTER_NAME`).

SELECT l.*, k.namespace, k.node_name, k.cluster_name
FROM logs l CROSS JOIN k8s_cluster_info k

`k8s_path` enrichment

Parses Kubernetes pod log paths (e.g. /var/log/pods/<namespace>_<pod>_<uid>/<container>/) to extract metadata. Queries that join a source path to this table should set source_metadata: ecs or another public path style on the file input and quote dotted column names in SQL, for example "file.path". Source paths are never written into raw log bytes.

Columns exposed by the enrichment table (named k8s_pods by default; set table_name: k8s in config to use the k8s alias shown in the examples below):

Column	Description
`log_path_prefix`	Directory prefix used as join key.
`namespace`	Kubernetes namespace.
`pod_name`	Pod name.
`pod_uid`	Pod UID.
`container_name`	Container name.

`static` enrichment

A one-row table with user-defined label columns from the YAML config.

enrichment:
  - type: static
    table_name: labels
    labels:
      environment: production
      cluster: us-east-1
      tier: backend

SELECT l.*, lbl.environment, lbl.cluster
FROM logs l CROSS JOIN labels lbl

`env_vars` enrichment

A one-row table populated from environment variables matching a name prefix. The prefix is stripped and the remainder lower-cased to form column names.

enrichment:
  - type: env_vars
    table_name: deploy_meta
    prefix: FFWD_META_

With FFWD_META_CLUSTER=prod and FFWD_META_REGION=us-east-1 set, the table exposes cluster and region columns.

SELECT l.*, m.cluster, m.region
FROM logs l CROSS JOIN deploy_meta m

`kv_file` enrichment

A one-row table parsed from a KEY=value properties file. Supports unquoted, double-quoted, and single-quoted values. Lines starting with # are comments. Column names are keys lower-cased.

enrichment:
  - type: kv_file
    table_name: os_release
    path: /etc/os-release
    refresh_interval: 3600   # optional: reload every N seconds; must be >= 1 when set

SELECT l.*, os.pretty_name, os.version_id
FROM logs l CROSS JOIN os_release os

Useful for /etc/os-release, .env files, or ConfigMap-mounted metadata files.

`csv` enrichment

A multi-row lookup table loaded from a CSV file. All columns are UTF-8 strings and are materialized internally as Arrow Utf8View columns for SQL execution. The first row must be column headers. Empty cells are empty strings; missing trailing cells are NULL.

enrichment:
  - type: csv
    table_name: assets
    path: /etc/ffwd/assets.csv
    refresh_interval: 3600   # optional: reload every N seconds; must be >= 1 when set

SELECT l.*, a.owner, a.team
FROM logs l LEFT JOIN assets a ON l.hostname = a.hostname

`jsonl` enrichment

A multi-row lookup table loaded from a JSON Lines file (one JSON object per line). Columns are the union of all keys across all rows.

enrichment:
  - type: jsonl
    table_name: ip_owners
    path: /etc/ffwd/ip-owners.jsonl
    refresh_interval: 1800   # optional: reload every N seconds; must be >= 1 when set

SELECT l.*, ipl.owner
FROM logs l LEFT JOIN ip_owners ipl ON l.client_ip = ipl.ip

`geo_database` enrichment

Registers a GeoIP database for use with the geo_lookup() SQL function. Supports MaxMind MMDB and CSV IP-range formats.

# MaxMind MMDB format
enrichment:
  - type: geo_database
    format: mmdb
    path: /data/GeoLite2-City.mmdb
    refresh_interval: 86400   # optional: reload daily; must be >= 1 when set

# CSV IP-range format (DB-IP Lite compatible)
enrichment:
  - type: geo_database
    format: csv_range
    path: /data/dbip-city-lite.csv

SELECT l.*,
  geo_lookup(l.client_ip).country_code AS country,
  geo_lookup(l.client_ip).city AS city,
  geo_lookup(l.client_ip).latitude AS lat,
  geo_lookup(l.client_ip).longitude AS lon
FROM logs l

The geo_lookup() function returns a struct with these fields:

Field	Type	Description
`country_code`	string	ISO 3166-1 two-letter code (e.g. `US`).
`country_name`	string	Full English country name.
`city`	string	City name.
`region`	string	State or subdivision name.
`latitude`	float	Decimal degrees.
`longitude`	float	Decimal degrees.
`asn`	integer	Autonomous System Number.
`org`	string	Organization name for the ASN.

Server configuration

The optional server block controls the diagnostics server and observability settings.

Field	Type	Default	Description
`diagnostics`	string	none	`host:port` to listen for HTTP diagnostics. See Diagnostics API.
`log_level`	string	`info`	Log verbosity. One of `error`, `warn`, `info`, `debug`, `trace`.
`metrics_endpoint`	string	none	OTLP endpoint for periodic metrics push, e.g. `http://otel-collector:4318`.
`metrics_interval_secs`	integer	`60`	Push interval for OTLP metrics in seconds.

server:
  diagnostics: 0.0.0.0:9090
  log_level: info
  metrics_endpoint: http://otel-collector:4318
  metrics_interval_secs: 30

Diagnostics API

When server.diagnostics is configured, FastForward exposes an HTTP API for monitoring and troubleshooting.

Route	Method	Description
`/`	GET	Dashboard HTML (visual explorer for metrics and traces).
`/live`	GET	Liveness probe. Returns 200 OK if the process and control plane are running.
`/ready`	GET	Readiness probe. Returns 200 OK when required components are initialized and in a ready health state; returns 503 while components are still starting, stopping, stopped, failed, or otherwise not ready.
`/admin/v1/status`	GET	Canonical rich status payload with live/ready state, component health, and per-pipeline counters.
`/admin/v1/stats`	GET	Aggregate process stats (uptime, RSS, CPU, aggregate line counts).
`/admin/v1/config`	GET	Currently loaded YAML configuration and its file path (disabled by default; enable with `FFWD_UNSAFE_EXPOSE_CONFIG=1`). May expose sensitive values; do not enable in shared or production environments unless strictly required.
`/admin/v1/logs`	GET	Recent log lines from FastForward’s own stderr (ring buffer).
`/admin/v1/history`	GET	Time-series data (1-hour window) for dashboard charts.
`/admin/v1/traces`	GET	Recent batch processing spans for detailed latency analysis.

For input diagnostics, bytes_total reflects source payload bytes accepted at the input boundary. For structured receivers such as OTLP, this is the accepted request-body size as received on the wire, not the in-memory Arrow batch footprint or the post-decompression payload size.

Storage configuration

The optional storage block controls where FastForward persists state (checkpoints, disk queue).

Field	Type	Default	Description
`data_dir`	string	none	Directory for state files. Created if it does not exist.

storage:
  data_dir: /var/lib/ffwd

Environment variable substitution

Any value in the config file can reference an environment variable with ${VAR}. Variable names must start with an ASCII letter or _, and then contain only ASCII letters, digits, or _. $VAR stays literal, and default expressions such as ${VAR:fallback} are rejected because : is not valid in variable names.

output:
  type: otlp
  endpoint: ${OTEL_COLLECTOR_ADDR}

server:
  metrics_endpoint: ${METRICS_PUSH_URL}

If the variable is not set, config loading fails fast with a validation error. An unterminated reference such as ${VAR is preserved literally so existing config text is not rewritten accidentally; completed placeholders before that literal tail are still expanded.

Environment variables are expanded as string data. The typed config schema then parses those strings into the field type, so numeric and boolean fields can read values directly from the environment without treating the env value as YAML:

pipelines:
  app:
    workers: ${FFWD_WORKERS}

String fields remain strings even when the expanded value looks like a YAML number, boolean, or null:

input:
  type: file
  path: ${LOG_PATH}

Placeholders embedded inside longer strings always remain strings:

output:
  type: file
  path: "/var/log/${SERVICE_NAME}.jsonl"

Environment variables can also appear in mapping keys. If expansion produces duplicate keys, config loading fails.

Complete example

pipelines:
  app:
    inputs:
      - name: pod_logs
        type: file
        path: /var/log/pods/**/*.log
        format: cri
    transform: |
      SELECT
        l.level,
        l.message,
        l.status,
        lbl.environment
      FROM logs l
      CROSS JOIN labels lbl
      WHERE l.level IN ('ERROR', 'WARN')
        OR l.status >= 500
    outputs:
      - name: collector
        type: otlp
        endpoint: ${OTEL_ENDPOINT}
        protocol: grpc
        compression: zstd
      - name: debug
        type: stdout
        format: console

enrichment:
  - type: host_info
  - type: process_info
  - type: network_info
  - type: container_info
  - type: k8s_cluster_info
  - type: static
    table_name: labels
    labels:
      environment: ${ENVIRONMENT}
      cluster: ${CLUSTER_NAME}

server:
  diagnostics: 0.0.0.0:9090
  log_level: info
  metrics_endpoint: ${OTEL_ENDPOINT}
  metrics_interval_secs: 60

storage:
  data_dir: /var/lib/ffwd

Configuration Reference

Overview

Config layout

Input configuration

Common fields

file input

s3 input

stdin input

generator input

udp input

tcp input

otlp input

http input

linux_ebpf_sensor input

macos_es_sensor input

windows_ebpf_sensor input

macos_log input

journald input

host_metrics input

arrow_ipc input

Input types

Formats

Output configuration

Common fields

otlp output

http output

stdout output

elasticsearch output

loki output

file output

tcp output

udp output

null output

Output types

SQL transform

Column naming convention

Built-in UDFs

Enrichment tables

host_info enrichment

process_info enrichment

network_info enrichment

container_info enrichment

k8s_cluster_info enrichment

k8s_path enrichment

static enrichment

env_vars enrichment

kv_file enrichment

csv enrichment

jsonl enrichment

geo_database enrichment

Server configuration

Diagnostics API

Storage configuration

Environment variable substitution

Complete example

`file` input

`s3` input

`stdin` input

`generator` input

`udp` input

`tcp` input

`otlp` input

`http` input

`linux_ebpf_sensor` input

`macos_es_sensor` input

`windows_ebpf_sensor` input

`macos_log` input

`journald` input

`host_metrics` input

`arrow_ipc` input

`otlp` output

`http` output

`stdout` output

`elasticsearch` output

`loki` output

`file` output

`tcp` output

`udp` output

`null` output

`host_info` enrichment

`process_info` enrichment

`network_info` enrichment

`container_info` enrichment

`k8s_cluster_info` enrichment

`k8s_path` enrichment

`static` enrichment

`env_vars` enrichment

`kv_file` enrichment

`csv` enrichment

`jsonl` enrichment

`geo_database` enrichment