Configuration Reference
FastForward commands that operate on pipeline config (for example run, validate, dry-run, and effective-config) accept a YAML file via --config <config.yaml>.
ff send accepts a destination-only config with top-level output, injects stdin as the input, drains the output, and exits.
Overview
Section titled “Overview”FastForward pipeline configs use a top-level pipelines map. Each named
pipeline defines inputs, optional transform, optional enrichment,
optional resource_attrs, and outputs.
Environment variables are expanded using ${VAR} syntax anywhere in the file.
If a variable is not set, config loading fails fast with a validation error.
Config layout
Section titled “Config layout”pipelines: errors: inputs: - name: pod_logs type: file path: /var/log/pods/**/*.log format: cri transform: SELECT * FROM logs WHERE level = 'ERROR' outputs: - type: otlp endpoint: http://otel-collector:4318/v1/logs
debug: inputs: - type: file path: /var/log/pods/**/*.log format: cri outputs: - type: stdout format: json
server: diagnostics: 0.0.0.0:9090Input configuration
Section titled “Input configuration”Each pipeline requires at least one input. Use a single mapping for one input or a YAML sequence for multiple inputs.
Common fields
Section titled “Common fields”| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Input type. See Input types. |
name | string | No | Friendly name shown in diagnostics. |
format | string | No | Log format. See Formats. Defaults to auto. |
source_metadata | string | No | Source metadata style. Defaults to none. Use fastforward for internal __source_id, ecs for public ECS columns such as file.path, otel for public OpenTelemetry columns such as log.file.path, or vector for public Vector-style columns such as file. Public source path styles require inputs that expose source path snapshots. File inputs expose filesystem paths; S3 inputs expose object keys. |
file input
Section titled “file input”Tail one or more log files that match a glob pattern.
| Field | Type | Required | Description |
|---|---|---|---|
path | string | Yes | Glob pattern, e.g. /var/log/pods/**/*.log. |
poll_interval_ms | integer | No | How often to poll the file when tailing, in milliseconds (default: 50). |
read_buf_size | integer | No | Buffer size for file reads in bytes (default: 262144, max: 4194304). |
per_file_read_budget_bytes | integer | No | Maximum bytes read per file per poll (default: 262144). |
adaptive_fast_polls_max | integer | No | Immediate repoll budget after a read-budget hit (default: 8, set 0 to disable adaptive fast repolls). |
input: type: file path: /var/log/pods/**/*.log format: cris3 input
Section titled “s3 input”Read objects from AWS S3 or an S3-compatible endpoint. The S3 input can poll
ListObjectsV2 by prefix, or process object notifications from SQS when
sqs_queue_url is set. S3 support is behind the s3 feature.
| Field | Type | Required | Description |
|---|---|---|---|
s3.bucket | string | Yes | Bucket name. |
s3.region | string | No | AWS region. Defaults to us-east-1. |
s3.endpoint | string | No | S3-compatible endpoint URL, for example http://localhost:9000. Path-style addressing is used when set. |
s3.prefix | string | No | Only process object keys with this prefix. |
s3.sqs_queue_url | string | No | SQS queue URL for event-driven object discovery. Omit to poll the bucket prefix. |
s3.start_after | string | No | Initial ListObjectsV2 StartAfter key for prefix polling. |
s3.access_key_id | string | No | AWS access key ID. Falls back to AWS_ACCESS_KEY_ID. |
s3.secret_access_key | string | No | AWS secret access key. Falls back to AWS_SECRET_ACCESS_KEY. |
s3.session_token | string | No | AWS session token. Falls back to AWS_SESSION_TOKEN. |
s3.part_size_bytes | integer | No | Range-GET part size. Defaults to 1048576. |
s3.max_concurrent_fetches | integer | No | Maximum concurrent range GETs per object. Defaults to 16. |
s3.max_concurrent_objects | integer | No | Maximum objects fetched at once. Defaults to 4. |
s3.visibility_timeout_secs | integer | No | SQS visibility timeout. Defaults to 300 and must be at least 30. |
s3.compression | enum | No | Compression override: auto, gzip, zstd, snappy, or none. Defaults to auto. |
s3.poll_interval_ms | integer | No | Prefix polling interval in milliseconds. Defaults to 5000. |
When source_metadata is ecs, otel, or vector, S3 exposes the
object key as the source path value for each row. fastforward only attaches
the internal __source_id, and none attaches no source metadata.
input: type: s3 format: json source_metadata: ecs s3: bucket: app-logs region: us-east-1 prefix: prod/stdin input
Section titled “stdin input”Read data from standard input. This input has no input-specific fields.
It is primarily used by ff send for command-line piping.
input: type: stdin format: autoSupported formats: auto, cri, json, and raw.
generator input
Section titled “generator input”Emit synthetic records for benchmarks, demos, and pipeline tests.
| Field | Type | Required | Description |
|---|---|---|---|
generator | object | No | Generator settings. If omitted, runtime defaults are used, including batch_size: 1000 and events_per_sec: 0. See Input Types for the detailed nested fields. |
generator.profile | string | No | Generator profile: logs (default), record, envoy, cri_k8s, wide, narrow, cloud_trail. See Generator profiles. |
input: type: generator generator: events_per_sec: 50000 batch_size: 4096udp input
Section titled “udp input”Listen for log lines on a UDP socket.
| Field | Type | Required | Description |
|---|---|---|---|
listen | string | Yes | host:port, e.g. 0.0.0.0:514. |
input: type: udp listen: 0.0.0.0:514 format: jsontcp input
Section titled “tcp input”Accept log lines on a TCP socket.
| Field | Type | Required | Description |
|---|---|---|---|
listen | string | Yes | host:port, e.g. 0.0.0.0:5140. |
tls | object | No | Optional server TLS options (cert_file, key_file, client_ca_file, require_client_auth). client_ca_file is valid only when require_client_auth: true. |
input: type: tcp listen: 0.0.0.0:5140 format: jsonotlp input
Section titled “otlp input”Receive OTLP log records from another agent or SDK.
| Field | Type | Required | Description |
|---|---|---|---|
listen | string | Yes | host:port, e.g. 0.0.0.0:4318. |
protobuf_decode_mode | string | No | Experimental protobuf decoder: prost (default), projected_fallback, or projected_only. Projected modes require a build with the otlp-research feature. |
input: type: otlp listen: 0.0.0.0:4318 protobuf_decode_mode: prosthttp input
Section titled “http input”Receive newline-delimited payloads over HTTP POST.
| Field | Type | Required | Description |
|---|---|---|---|
listen | string | Yes | host:port, e.g. 0.0.0.0:8081. |
http.path | string | No | Route path. Must start with /. Defaults to /. |
http.strict_path | boolean | No | When true (default), require exact path match. |
http.method | string | No | Accepted method. Defaults to POST. |
http.max_request_body_size | integer | No | Maximum request body size in bytes. Defaults to 10 MiB. |
http.max_drained_bytes_per_poll | integer | No | Maximum bytes drained from the internal request queue per poll. Defaults to 1 GiB. |
http.response_code | integer | No | Success code. One of 200, 201, 202, 204 (default 200). |
http.response_body | string | No | Optional static success response body. Not allowed when http.response_code: 204. |
input: type: http listen: 0.0.0.0:8081 format: json http: path: /ingest strict_path: true method: POST max_request_body_size: 10485760 max_drained_bytes_per_poll: 1073741824 response_code: 200 response_body: '{"ok":true}'linux_ebpf_sensor input
Section titled “linux_ebpf_sensor input”Linux eBPF sensor input for platform-native ingestion. This input is
Arrow-native and does not support format.
| Field | Type | Required | Description |
|---|---|---|---|
sensor.poll_interval_ms | integer | No | Periodic sample cadence in milliseconds. Must be >= 1. Defaults to 10000. |
sensor.control_path | string | No | Optional JSON control-plane file path for runtime reload. |
sensor.control_reload_interval_ms | integer | No | Reload check interval in milliseconds. Must be >= 1. Defaults to 1000. |
sensor.enabled_families | array[string] | No | Optional enabled signal families for this target (process,file,network,dns,authz on Linux). Omit to use defaults; set [] to disable all families. |
sensor.emit_signal_rows | boolean | No | Emit periodic per-family sample rows. Defaults to true. |
input: type: linux_ebpf_sensor sensor: poll_interval_ms: 2000macos_es_sensor input
Section titled “macos_es_sensor input”macOS EndpointSecurity sensor input. This input is Arrow-native and does not
support format.
| Field | Type | Required | Description |
|---|---|---|---|
sensor.poll_interval_ms | integer | No | Periodic sample cadence in milliseconds. Must be >= 1. Defaults to 10000. |
sensor.control_path | string | No | Optional JSON control-plane file path for runtime reload. |
sensor.control_reload_interval_ms | integer | No | Reload check interval in milliseconds. Must be >= 1. Defaults to 1000. |
sensor.enabled_families | array[string] | No | Optional enabled signal families (process,file,network,dns,module,authz on macOS). Omit to use defaults; set [] to disable all families. |
sensor.emit_signal_rows | boolean | No | Emit periodic per-family sample rows. Defaults to true. |
input: type: macos_es_sensorwindows_ebpf_sensor input
Section titled “windows_ebpf_sensor input”Windows eBPF sensor input. This input is Arrow-native and does not support
format.
| Field | Type | Required | Description |
|---|---|---|---|
sensor.poll_interval_ms | integer | No | Periodic sample cadence in milliseconds. Must be >= 1. Defaults to 10000. |
sensor.control_path | string | No | Optional JSON control-plane file path for runtime reload. |
sensor.control_reload_interval_ms | integer | No | Reload check interval in milliseconds. Must be >= 1. Defaults to 1000. |
sensor.enabled_families | array[string] | No | Optional enabled signal families (process,file,network,dns,module,registry,authz on Windows). Omit to use defaults; set [] to disable all families. |
sensor.emit_signal_rows | boolean | No | Emit periodic per-family sample rows. Defaults to true. |
input: type: windows_ebpf_sensormacos_log input
Section titled “macos_log input”Read macOS unified log entries by running log stream. This input is only
available on macOS and emits structured rows parsed from the command output.
| Field | Type | Required | Description |
|---|---|---|---|
macos_log.level | string | No | Optional log level filter. Must not be empty when set. |
macos_log.subsystem | string | No | Optional subsystem filter. Must not be empty when set. |
macos_log.process | string | No | Optional process filter. Must not be empty when set. |
pipelines: default: inputs: - type: macos_log macos_log: level: info subsystem: com.example.app outputs: - type: stdout format: jsonjournald input
Section titled “journald input”Read structured entries from the systemd journal using either the native sd_journal
C API or a journalctl subprocess.
| Field | Type | Required | Description |
|---|---|---|---|
journald.include_units | list | No | Systemd units to include. Suffix .service automatically if omitted. |
journald.exclude_units | list | No | Systemd units to exclude. |
journald.identifiers | list | No | Syslog identifiers (SYSLOG_IDENTIFIER=) to include. |
journald.priorities | list | No | Priority/log levels to include (e.g. 0, 3, info, err). |
journald.cursor_path | string | No | Path to persist cursor for resume after restarts. |
journald.include_boot_id | bool | No | Include _BOOT_ID field (default: false). |
journald.current_boot_only | bool | No | Only include entries from the current boot (default: true). |
journald.since_now | bool | No | Only include entries appended after start (default: false). |
journald.journalctl_path | string | No | Path to journalctl binary. Defaults to journalctl on PATH. |
journald.journal_directory | string | No | Custom journal directory (--directory=<path>). |
journald.journal_namespace | string | No | Journal namespace (--namespace=<ns>). |
journald.backend | enum | No | auto (default), native (require sd_journal API), or subprocess (always use journalctl). |
pipelines: default: inputs: - type: journald journald: include_units: - nginx - redis priorities: - err - warning cursor_path: /var/lib/ffwd/journald.cursorhost_metrics input
Section titled “host_metrics input”Host metrics input that collects process snapshots, CPU, memory, and network
statistics via sysinfo. This input is Arrow-native and does not support
format. The OS-specific implementation is selected at compile time based on
the build target.
| Field | Type | Required | Description |
|---|---|---|---|
sensor.poll_interval_ms | integer | No | Periodic sample cadence in milliseconds. Must be >= 1. Defaults to 10000. |
sensor.control_path | string | No | Optional JSON control-plane file path for runtime reload. |
sensor.control_reload_interval_ms | integer | No | Reload check interval in milliseconds. Must be >= 1. Defaults to 1000. |
sensor.enabled_families | array[string] | No | Optional enabled signal families. Omit to use defaults; set [] to disable all families. |
sensor.emit_signal_rows | boolean | No | Emit periodic per-family sample rows. Defaults to true. |
sensor.max_rows_per_poll | integer | No | Upper bound on data rows returned per collection cycle. Defaults to 256. Set to 0 or omit for the default. |
sensor.max_process_rows_per_poll | integer | No | Upper bound on process snapshot rows returned per collection cycle. Defaults to 1024. Set to 0 or omit for the default. |
sensor.scrapers | array[string] | No | List of scrapers to run. Supported values are: cpu, memory, disk, network, filesystem. |
sensor.collection_interval_ms | integer | No | Metrics collection cadence in milliseconds. Defaults to 10000. |
sensor.disk_include_devices | array[string] | No | Optional list of disk devices to include in scraping. |
sensor.disk_exclude_devices | array[string] | No | Optional list of disk devices to exclude from scraping. |
sensor.network_include_interfaces | array[string] | No | Optional list of network interfaces to include in scraping. |
sensor.network_exclude_interfaces | array[string] | No | Optional list of network interfaces to exclude from scraping. |
sensor.filesystem_include_mount_points | array[string] | No | Optional list of filesystem mount points to include in scraping. |
sensor.filesystem_exclude_mount_points | array[string] | No | Optional list of filesystem mount points to exclude from scraping. |
input: type: host_metrics sensor: poll_interval_ms: 5000 collection_interval_ms: 5000 scrapers: ["cpu", "memory"]arrow_ipc input
Section titled “arrow_ipc input”Receive Arrow IPC stream payloads over HTTP POST and forward decoded
RecordBatch values directly into the pipeline (scanner bypass).
| Field | Type | Required | Description |
|---|---|---|---|
listen | string | Yes | host:port, e.g. 0.0.0.0:4319. |
Behavior:
- Route is fixed to
POST /v1/arrowfor MVP. arrow_ipcis Arrow-native and rejectsformat.- Canonical payload types are
application/vnd.apache.arrow.streamandapplication/vnd.apache.arrow.stream+zstd. Content-Encoding: zstdis also supported for compressed Arrow stream payloads.- The receiver currently decodes by payload bytes and may still accept requests with missing/other content-type headers; use canonical content types for predictable interoperability.
Input types
Section titled “Input types”| Value | Status | Description |
|---|---|---|
file | Implemented | Tail files matching a glob pattern. |
s3 | Implemented | Read objects from AWS S3 or an S3-compatible endpoint. |
stdin | Implemented | Read piped stdin until EOF, then drain outputs and exit. |
generator | Implemented | Emit synthetic JSON-like records from an in-process source. |
udp | Implemented | Receive log lines over UDP. |
tcp | Implemented | Accept log lines over TCP. |
otlp | Implemented | Receive OTLP logs over a bound listen address. |
http | Implemented | Receive newline-delimited payloads via HTTP POST. |
linux_ebpf_sensor | Implemented | Linux eBPF sensor input (Arrow-native control + signal rows). |
macos_es_sensor | Implemented | macOS EndpointSecurity sensor input (Arrow-native control + signal rows). |
macos_log | Implemented | Read macOS unified log entries from the log stream command. |
windows_ebpf_sensor | Implemented | Windows eBPF sensor input (Arrow-native control + signal rows). |
journald | Beta | Read structured journal entries from systemd journald. |
host_metrics | Implemented | Host metrics input — process snapshots, CPU, memory, network stats via sysinfo (Arrow-native). |
arrow_ipc | Implemented | Receive Arrow IPC stream batches via HTTP POST /v1/arrow. |
Formats
Section titled “Formats”The format field controls how raw bytes from the input are parsed into log records.
linux_ebpf_sensor, macos_es_sensor, windows_ebpf_sensor, and arrow_ipc are
Arrow-native and reject format.
| Value | Description |
|---|---|
auto | Auto-detect (default). Tries CRI first, then JSON, then raw. |
cri | CRI container log format (<timestamp> <stream> <flags> <message>). Multi-line log reassembly via the P partial flag is supported. |
json | Newline-delimited JSON. Each line must be a single JSON object. |
raw | Treat each line as an opaque string stored in body. |
logfmt | Key=value pairs (e.g. level=info msg="hello"). Not yet implemented. |
console | Human-readable coloured output for interactive debugging. Output mode only. |
Output configuration
Section titled “Output configuration”Each pipeline requires at least one output.
Common fields
Section titled “Common fields”| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Output type. See Output types. |
name | string | No | Friendly name shown in diagnostics. |
otlp output
Section titled “otlp output”Send log records as OTLP protobuf to an OpenTelemetry collector.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
endpoint | string | Yes | — | Full collector URL, e.g. http://otel-collector:4317 (gRPC) or http://otel-collector:4318/v1/logs (HTTP). |
protocol | enum | No | http | http or grpc. Invalid values are rejected while parsing config. |
compression | enum | No | none | zstd, gzip, or none for the request body. Invalid values are rejected while parsing config. |
auth | object | No | — | Optional bearer token or custom headers for HTTP auth. |
tls | object | No | — | Optional TLS client options (ca_file, cert_file, key_file, insecure_skip_verify) for HTTPS endpoints. |
headers | map[string,string] | No | — | Additional static HTTP headers to send with every export request. |
retry_attempts | integer | No | — | Maximum export retry attempts. |
retry_initial_backoff_ms | integer | No | — | Initial backoff delay in milliseconds. |
retry_max_backoff_ms | integer | No | — | Maximum backoff delay in milliseconds. |
request_timeout_ms | integer | No | — | Export request timeout in milliseconds. |
batch_size | integer | No | — | Maximum rows per OTLP request. |
batch_timeout_ms | integer | No | — | Maximum time to buffer rows before exporting. |
pipelines: default: inputs: - type: stdin format: json outputs: - type: otlp endpoint: http://otel-collector:4317 protocol: grpc compression: zstdhttp output
Section titled “http output”Send newline-delimited JSON rows to an HTTP endpoint with optional request-body compression and auth headers.
| Field | Type | Required | Description |
|---|---|---|---|
endpoint | string | Yes | Full URL, e.g. http://ingest.example.com/logs. |
format | enum | No | Must be json when set. |
compression | enum | No | zstd, gzip, or none. |
auth | object | No | Optional bearer token or custom headers for HTTP auth. |
pipelines: default: inputs: - type: stdin format: json outputs: - type: http endpoint: https://ingest.example.com/logs compression: zstdstdout output
Section titled “stdout output”Print records to standard output for local debugging.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
format | string | No | console | json (newline-delimited JSON), console (coloured text), or text (raw text). |
output: type: stdout format: consoleelasticsearch output
Section titled “elasticsearch output”Ship to Elasticsearch via the Bulk API.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
endpoint | string | Yes | — | Elasticsearch base URL. |
index | string | No | logs | Target index name. Must not be empty, and must not contain Elasticsearch-reserved characters or prefixes. |
compression | enum | No | none | gzip or none. zstd is rejected for Elasticsearch by validation. |
request_mode | enum | No | buffered | buffered or streaming. Invalid values are rejected while parsing config; streaming currently requires compression: none. |
request_timeout_ms | integer | No | 30000 | HTTP request timeout in milliseconds. Must be >= 1 when set. |
tls | object | No | — | Optional TLS client options (ca_file, cert_file, key_file, insecure_skip_verify) for HTTPS endpoints. |
auth | object | No | — | Optional bearer token or custom headers for HTTP auth. |
retry | object | No | — | Optional retry configuration (max_attempts, initial_backoff_secs, max_backoff_secs). |
batch | object | No | — | Optional batching configuration (max_bytes, max_events, timeout_secs). |
Bulk payloads are split before they exceed 5242880 bytes (5 MiB). That limit is internal and is not a YAML field.
output: type: elasticsearch endpoint: https://es-cluster:9200 index: logs compression: gzip request_mode: buffered auth: bearer_token: "${ES_TOKEN}"loki output
Section titled “loki output”Push to Grafana Loki.
| Field | Type | Required | Description |
|---|---|---|---|
endpoint | string | Yes | Loki base URL. The /loki/api/v1/push path is appended automatically. |
request_timeout_ms | integer | No | HTTP request timeout in milliseconds (default: 30000). Must be >= 1 when set. |
tls | object | No | Optional TLS client options (ca_file, cert_file, key_file, insecure_skip_verify) for HTTPS endpoints. |
tenant_id | string | No | Optional value sent as X-Scope-OrgID for multi-tenant Loki deployments. |
static_labels | map[string,string] | No | Static labels applied to every pushed log stream. Keys and values must be non-empty. |
label_columns | array[string] | No | Additional log columns to promote as Loki labels. |
auth | object | No | Optional bearer token or custom headers for HTTP auth. |
retry | object | No | Optional retry configuration (max_attempts, initial_backoff_secs, max_backoff_secs). |
batch | object | No | Optional batching configuration (max_bytes, max_events, timeout_secs). |
output: type: loki endpoint: http://loki:3100 tenant_id: team-a static_labels: app: ffwd env: prod label_columns: [service, level] auth: bearer_token: "${LOKI_TOKEN}"compression is not supported for Loki outputs.
file output
Section titled “file output”Write records to a file.
| Field | Type | Required | Description |
|---|---|---|---|
path | string | Yes | Destination file path. Parent directory must already exist and be writable. |
format | string | No | json for NDJSON output, or text to write raw lines. |
output: type: file path: /var/log/ffwd/capture.ndjson format: jsontcp output
Section titled “tcp output”Send newline-delimited JSON records to a TCP endpoint.
| Field | Type | Required | Description |
|---|---|---|---|
endpoint | string | Yes | Host:port destination (for example tcp.example.com:9000). |
TCP output currently emits newline-delimited JSON. encoding, framing, tls,
keepalive, timeout_secs, retry, and batch are rejected until the sink
runtime implements them.
udp output
Section titled “udp output”Send newline-delimited JSON records as UDP datagrams.
| Field | Type | Required | Description |
|---|---|---|---|
endpoint | string | Yes | Host:port destination (for example udp.example.com:514). |
UDP output currently emits newline-delimited JSON using the built-in datagram
size. encoding and max_datagram_size_bytes are rejected until the sink
runtime implements them.
null output
Section titled “null output”Drop records intentionally for tests and benchmark baselines. The type value must
be quoted as a YAML string; unquoted type: null is YAML’s null value and is
rejected. Null outputs do not accept sink-specific fields such as endpoint,
format, auth, tls, retry, or batch controls.
output: type: "null"Output types
Section titled “Output types”| Value | Status | Description |
|---|---|---|
otlp | Implemented | OTLP protobuf over HTTP or gRPC. |
http | Implemented | POST newline-delimited JSON rows to an HTTP endpoint. |
stdout | Implemented | Print to stdout (JSON, console, or text). |
elasticsearch | Implemented | Elasticsearch Bulk API with index/compression/request-mode controls. |
loki | Implemented | Grafana Loki push API with label grouping. |
file | Implemented | Write NDJSON or text to a local file. |
null | Implemented | Drop records intentionally for tests and benchmark baselines. |
tcp | Implemented | Send records to a TCP endpoint. |
udp | Implemented | Send records to a UDP endpoint. |
arrow_ipc | Implemented | Send Arrow IPC payloads to an HTTP endpoint. |
SQL transform
Section titled “SQL transform”The optional transform field contains a DataFusion SQL query that is applied to every
Arrow RecordBatch produced by the scanner. The source table is always named logs.
transform: SELECT level, message, status FROM logs WHERE status >= 400Multi-line SQL is supported with YAML block scalars:
transform: | SELECT level, message, regexp_extract(message, 'request_id=([a-f0-9-]+)', 1) AS request_id, status FROM logs WHERE level IN ('ERROR', 'WARN') AND status >= 400Column naming convention
Section titled “Column naming convention”The scanner maps each JSON field to a typed Arrow column using the field’s base name (no type suffix):
| JSON value type | Arrow column type | Column name | Example |
|---|---|---|---|
| String | StringArray | {field} | level |
| Integer | Int64Array | {field} | status |
| Float | Float64Array | {field} | latency_ms |
| Boolean | StringArray ("true"/"false") | {field} | enabled |
| Null | null in column | {field} | — |
| Object / Array | StringArray (raw JSON) | {field} | metadata |
When a field contains mixed types across rows, the scanner emits a single
Struct column under the field’s base name containing one child per observed
type (e.g., a status Struct with int and str children). Legacy
single-underscore suffixed columns (status_int, level_str) are not emitted.
Special columns attached by the runtime after scan, plus format-derived columns:
| Column | Type | Description |
|---|---|---|
body | string | Original input line (when input line capture is enabled, e.g. line_field: body, or when a non-JSON CRI line is wrapped for scanner safety). |
__source_id | uint64 | FastForward internal row-level source identity when source_metadata: fastforward is set. SQL can reference it, but user-facing sinks drop this internal column unless SQL aliases it to a public name. |
file.path | string | ECS-style source file path when source_metadata: ecs is set. Quote it in SQL as "file.path". |
log.file.path | string | OpenTelemetry-style source file path when source_metadata: otel is set. Quote it in SQL as "log.file.path". |
file | string | Vector-style source file path when source_metadata: vector is set. |
_timestamp | string | Timestamp from the CRI header as an RFC 3339 string (CRI inputs only). |
_stream | string | CRI stream name (stdout / stderr). |
Source metadata is never written into raw input bytes. It is carried beside
scanner-ready chunks and, when source_metadata is not none, materialized as
table columns before SQL runs. SQL does no hidden pruning or widening: SELECT *
returns the columns that exist in the table. User-facing sinks drop known
FastForward internal columns such as __source_id by default; alias an
internal column to a public name when it should be emitted. User payload fields
that happen to start with __ are not treated as internal. Public source path
styles currently require inputs that expose source path snapshots. File inputs
use filesystem paths; S3 inputs use object keys. Use fastforward for source
identity on inputs that do not expose public source descriptors.
Built-in UDFs
Section titled “Built-in UDFs”| Function | Signature | Description |
|---|---|---|
int(expr) | int(any) → int64 | Cast any value to int64. Returns NULL on failure. |
float(expr) | float(any) → float64 | Cast any value to float64. Returns NULL on failure. |
grok(input, pattern) | grok(utf8, utf8) → Struct | Apply a Grok pattern to input and return the captures as a struct. |
regexp_extract(input, pattern, group) | regexp_extract(utf8, utf8, int64) → utf8 | Return capture group group from a regex match. |
Examples:
-- Cast a string column to intSELECT int(status) AS status FROM logs
-- Extract a field with GrokSELECT grok(message, '%{IP:client} %{WORD:method} %{URIPATHPARAM:path}') AS parsed FROM logs
-- Extract a named group with regexSELECT regexp_extract(message, 'user=([a-z]+)', 1) AS user FROM logs
-- Type-cast from environment-injected stringSELECT float(duration) AS duration_ms FROM logsEnrichment tables
Section titled “Enrichment tables”Enrichment tables are one-row (or multi-row) Arrow tables registered in DataFusion
alongside the logs table. Use CROSS JOIN for one-row tables or LEFT JOIN for
multi-row lookup tables.
enrichment: - type: host_info - type: process_info - type: network_info - type: container_info - type: k8s_cluster_info - type: k8s_path - type: static table_name: labels labels: environment: production region: us-east-1 - type: kv_file table_name: os_release path: /etc/os-release - type: env_vars table_name: deploy_meta prefix: FFWD_META_ - type: csv table_name: assets path: /etc/ffwd/assets.csv - type: jsonl table_name: ip_owners path: /etc/ffwd/ip-owners.jsonl - type: geo_database format: mmdb path: /data/GeoLite2-City.mmdbhost_info enrichment
Section titled “host_info enrichment”System host metadata, resolved once at startup. Fixed table name: host_info.
Extended fields are sourced from /etc/os-release, /proc/sys/kernel/osrelease,
/etc/machine-id, and /proc/sys/kernel/random/boot_id on Linux; they degrade
to empty strings on other platforms.
| Field | Description |
|---|---|
style | Column naming convention: raw (default), ecs / beats, or otel. |
enrichment: - type: host_info style: ecs # Use ECS/Beats dotted column namesColumn names by style:
| Semantic | raw (default) | ecs / beats | otel |
|---|---|---|---|
| hostname | hostname | host.hostname | host.name |
| OS type | os_type | host.os.type | os.type |
| architecture | os_arch | host.architecture | host.arch |
| OS name | os_name | host.os.name | os.name |
| OS family | os_family | host.os.family | os.family |
| OS version | os_version | host.os.version | os.version |
| kernel | os_kernel | host.os.kernel | os.kernel |
| machine id | host_id | host.id | host.id |
| boot id | boot_id | host.boot.id | host.boot.id |
SELECT l.*, h.hostname, h.os_type, h.os_name, h.os_kernelFROM logs l CROSS JOIN host_info hprocess_info enrichment
Section titled “process_info enrichment”Agent self-metadata, resolved once at startup. Fixed table name: process_info.
| Column | Description |
|---|---|
agent_name | Always ffwd. |
agent_version | Semantic version of the running binary. |
pid | Process ID (as string). |
start_time | ISO 8601 UTC timestamp captured when the process_info enrichment table is constructed during pipeline startup. |
SELECT l.*, p.agent_version, p.start_timeFROM logs l CROSS JOIN process_info pnetwork_info enrichment
Section titled “network_info enrichment”Network interface metadata from procfs, resolved once at startup. Fixed table name: network_info.
| Column | Description |
|---|---|
hostname | System hostname. |
primary_ipv4 | Lexicographically first non-loopback IPv4 address, or empty. On multihomed hosts this may not match the default-route interface; use all_ipv4 for full coverage. |
primary_ipv6 | Lexicographically first non-loopback, non-link-local IPv6 address, or empty. Same caveat as primary_ipv4. |
all_ipv4 | Comma-separated list of all non-loopback IPv4 addresses. |
all_ipv6 | Comma-separated list of all non-loopback, non-link-local IPv6 addresses. |
SELECT l.*, n.primary_ipv4FROM logs l CROSS JOIN network_info ncontainer_info enrichment
Section titled “container_info enrichment”Container runtime detection from /proc/self/cgroup and /.dockerenv, resolved once
at startup. Fixed table name: container_info.
| Column | Description |
|---|---|
container_id | 64-character hex container ID, or empty if not in a container. |
container_runtime | docker, containerd, cri-o, kubernetes, unknown, or empty. |
SELECT l.*, c.container_id, c.container_runtimeFROM logs l CROSS JOIN container_info ck8s_cluster_info enrichment
Section titled “k8s_cluster_info enrichment”Kubernetes cluster metadata from the downward API environment variables, resolved
once at startup. Fixed table name: k8s_cluster_info.
Populate these via fieldRef in your DaemonSet pod spec:
env: - name: K8S_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: K8S_CLUSTER_NAME value: my-cluster # or from a ConfigMap - name: K8S_SERVICE_ACCOUNT valueFrom: fieldRef: fieldPath: spec.serviceAccountName| Column | Description |
|---|---|
namespace | Pod namespace (from mounted service account path). |
pod_name | Pod name (from HOSTNAME). |
node_name | Node name (from K8S_NODE_NAME or NODE_NAME). |
service_account | Service account name (from K8S_SERVICE_ACCOUNT or SERVICE_ACCOUNT env var). |
cluster_name | Cluster name (from K8S_CLUSTER_NAME or CLUSTER_NAME). |
SELECT l.*, k.namespace, k.node_name, k.cluster_nameFROM logs l CROSS JOIN k8s_cluster_info kk8s_path enrichment
Section titled “k8s_path enrichment”Parses Kubernetes pod log paths (e.g.
/var/log/pods/<namespace>_<pod>_<uid>/<container>/) to extract metadata.
Queries that join a source path to this table should set source_metadata: ecs
or another public path style on the file input and quote dotted column names in
SQL, for example "file.path". Source paths are never written into raw log
bytes.
Columns exposed by the enrichment table (named k8s_pods by default; set table_name: k8s in config to use the k8s alias shown in the examples below):
| Column | Description |
|---|---|
log_path_prefix | Directory prefix used as join key. |
namespace | Kubernetes namespace. |
pod_name | Pod name. |
pod_uid | Pod UID. |
container_name | Container name. |
static enrichment
Section titled “static enrichment”A one-row table with user-defined label columns from the YAML config.
enrichment: - type: static table_name: labels labels: environment: production cluster: us-east-1 tier: backendSELECT l.*, lbl.environment, lbl.clusterFROM logs l CROSS JOIN labels lblenv_vars enrichment
Section titled “env_vars enrichment”A one-row table populated from environment variables matching a name prefix. The prefix is stripped and the remainder lower-cased to form column names.
enrichment: - type: env_vars table_name: deploy_meta prefix: FFWD_META_With FFWD_META_CLUSTER=prod and FFWD_META_REGION=us-east-1 set, the table
exposes cluster and region columns.
SELECT l.*, m.cluster, m.regionFROM logs l CROSS JOIN deploy_meta mkv_file enrichment
Section titled “kv_file enrichment”A one-row table parsed from a KEY=value properties file. Supports unquoted,
double-quoted, and single-quoted values. Lines starting with # are comments.
Column names are keys lower-cased.
enrichment: - type: kv_file table_name: os_release path: /etc/os-release refresh_interval: 3600 # optional: reload every N seconds; must be >= 1 when setSELECT l.*, os.pretty_name, os.version_idFROM logs l CROSS JOIN os_release osUseful for /etc/os-release, .env files, or ConfigMap-mounted metadata files.
csv enrichment
Section titled “csv enrichment”A multi-row lookup table loaded from a CSV file. All columns are UTF-8 strings
and are materialized internally as Arrow Utf8View columns for SQL execution.
The first row must be column headers. Empty cells are empty strings; missing
trailing cells are NULL.
enrichment: - type: csv table_name: assets path: /etc/ffwd/assets.csv refresh_interval: 3600 # optional: reload every N seconds; must be >= 1 when setSELECT l.*, a.owner, a.teamFROM logs l LEFT JOIN assets a ON l.hostname = a.hostnamejsonl enrichment
Section titled “jsonl enrichment”A multi-row lookup table loaded from a JSON Lines file (one JSON object per line). Columns are the union of all keys across all rows.
enrichment: - type: jsonl table_name: ip_owners path: /etc/ffwd/ip-owners.jsonl refresh_interval: 1800 # optional: reload every N seconds; must be >= 1 when setSELECT l.*, ipl.ownerFROM logs l LEFT JOIN ip_owners ipl ON l.client_ip = ipl.ipgeo_database enrichment
Section titled “geo_database enrichment”Registers a GeoIP database for use with the geo_lookup() SQL function.
Supports MaxMind MMDB and CSV IP-range formats.
# MaxMind MMDB formatenrichment: - type: geo_database format: mmdb path: /data/GeoLite2-City.mmdb refresh_interval: 86400 # optional: reload daily; must be >= 1 when set
# CSV IP-range format (DB-IP Lite compatible)enrichment: - type: geo_database format: csv_range path: /data/dbip-city-lite.csvSELECT l.*, geo_lookup(l.client_ip).country_code AS country, geo_lookup(l.client_ip).city AS city, geo_lookup(l.client_ip).latitude AS lat, geo_lookup(l.client_ip).longitude AS lonFROM logs lThe geo_lookup() function returns a struct with these fields:
| Field | Type | Description |
|---|---|---|
country_code | string | ISO 3166-1 two-letter code (e.g. US). |
country_name | string | Full English country name. |
city | string | City name. |
region | string | State or subdivision name. |
latitude | float | Decimal degrees. |
longitude | float | Decimal degrees. |
asn | integer | Autonomous System Number. |
org | string | Organization name for the ASN. |
Server configuration
Section titled “Server configuration”The optional server block controls the diagnostics server and observability settings.
| Field | Type | Default | Description |
|---|---|---|---|
diagnostics | string | none | host:port to listen for HTTP diagnostics. See Diagnostics API. |
log_level | string | info | Log verbosity. One of error, warn, info, debug, trace. |
metrics_endpoint | string | none | OTLP endpoint for periodic metrics push, e.g. http://otel-collector:4318. |
metrics_interval_secs | integer | 60 | Push interval for OTLP metrics in seconds. |
server: diagnostics: 0.0.0.0:9090 log_level: info metrics_endpoint: http://otel-collector:4318 metrics_interval_secs: 30Diagnostics API
Section titled “Diagnostics API”When server.diagnostics is configured, FastForward exposes an HTTP API for monitoring and troubleshooting.
| Route | Method | Description |
|---|---|---|
/ | GET | Dashboard HTML (visual explorer for metrics and traces). |
/live | GET | Liveness probe. Returns 200 OK if the process and control plane are running. |
/ready | GET | Readiness probe. Returns 200 OK when required components are initialized and in a ready health state; returns 503 while components are still starting, stopping, stopped, failed, or otherwise not ready. |
/admin/v1/status | GET | Canonical rich status payload with live/ready state, component health, and per-pipeline counters. |
/admin/v1/stats | GET | Aggregate process stats (uptime, RSS, CPU, aggregate line counts). |
/admin/v1/config | GET | Currently loaded YAML configuration and its file path (disabled by default; enable with FFWD_UNSAFE_EXPOSE_CONFIG=1). May expose sensitive values; do not enable in shared or production environments unless strictly required. |
/admin/v1/logs | GET | Recent log lines from FastForward’s own stderr (ring buffer). |
/admin/v1/history | GET | Time-series data (1-hour window) for dashboard charts. |
/admin/v1/traces | GET | Recent batch processing spans for detailed latency analysis. |
For input diagnostics, bytes_total reflects source payload bytes accepted at
the input boundary. For structured receivers such as OTLP, this is the
accepted request-body size as received on the wire, not the in-memory Arrow
batch footprint or the post-decompression payload size.
Storage configuration
Section titled “Storage configuration”The optional storage block controls where FastForward persists state (checkpoints, disk
queue).
| Field | Type | Default | Description |
|---|---|---|---|
data_dir | string | none | Directory for state files. Created if it does not exist. |
storage: data_dir: /var/lib/ffwdEnvironment variable substitution
Section titled “Environment variable substitution”Any value in the config file can reference an environment variable with ${VAR}.
Variable names must start with an ASCII letter or _, and then contain only
ASCII letters, digits, or _. $VAR stays literal, and default expressions
such as ${VAR:fallback} are rejected because : is not valid in variable
names.
output: type: otlp endpoint: ${OTEL_COLLECTOR_ADDR}
server: metrics_endpoint: ${METRICS_PUSH_URL}If the variable is not set, config loading fails fast with a validation error.
An unterminated reference such as ${VAR is preserved literally so existing
config text is not rewritten accidentally; completed placeholders before that
literal tail are still expanded.
Environment variables are expanded as string data. The typed config schema then parses those strings into the field type, so numeric and boolean fields can read values directly from the environment without treating the env value as YAML:
pipelines: app: workers: ${FFWD_WORKERS}String fields remain strings even when the expanded value looks like a YAML number, boolean, or null:
input: type: file path: ${LOG_PATH}Placeholders embedded inside longer strings always remain strings:
output: type: file path: "/var/log/${SERVICE_NAME}.jsonl"Environment variables can also appear in mapping keys. If expansion produces duplicate keys, config loading fails.
Complete example
Section titled “Complete example”pipelines: app: inputs: - name: pod_logs type: file path: /var/log/pods/**/*.log format: cri transform: | SELECT l.level, l.message, l.status, lbl.environment FROM logs l CROSS JOIN labels lbl WHERE l.level IN ('ERROR', 'WARN') OR l.status >= 500 outputs: - name: collector type: otlp endpoint: ${OTEL_ENDPOINT} protocol: grpc compression: zstd - name: debug type: stdout format: console
enrichment: - type: host_info - type: process_info - type: network_info - type: container_info - type: k8s_cluster_info - type: static table_name: labels labels: environment: ${ENVIRONMENT} cluster: ${CLUSTER_NAME}
server: diagnostics: 0.0.0.0:9090 log_level: info metrics_endpoint: ${OTEL_ENDPOINT} metrics_interval_secs: 60
storage: data_dir: /var/lib/ffwd