Observability & Alerting

The homelab has a complete observability stack (SigNoz v0.113, OTel, Linkerd) with a fully operational alerting pipeline. 22 alert rules monitor infrastructure health, application availability, and GitOps state. Alerts are synced to SigNoz via the signoz-dashboard-sidecar and notify through Incident.io.

Current State

What works: 22 alert rules are synced, evaluating, and generating alert history in SigNoz v0.113. The signoz-dashboard-sidecar reconciles alert ConfigMaps every 5 minutes, creating or updating rules via the SigNoz API. Incident.io notification channel (incidentio) is configured via webhook, with Bearer token auth sourced from 1Password.

Alert inventory:

11 HTTPCheck alerts — monitor service availability via httpcheck.status metric
7 Kubernetes health alerts — node conditions (disk/memory/PID pressure, readiness), pod health (OOMKilled, pending, restart rate)
4 ArgoCD app state alerts — degraded, missing, out-of-sync, suspended

Remaining gaps:

~17 services lack HTTPCheck alerts (#445, #444)
No dead man's switch for the httpcheck receiver itself
Sidecar's own logs are not collected by the OTel collector (uses slog to stdout but not instrumented)

Alert ConfigMap Format (SigNoz v0.113)

All alerts are Kubernetes ConfigMaps with the signoz.io/alert: "true" label, discovered and synced by the signoz-dashboard-sidecar. The alert JSON lives in data.alert.json.

Required Fields

json

{
  "alert": "Alert Name",
  "alertType": "METRICS_BASED_ALERT",
  "ruleType": "threshold_rule",
  "version": "v5",
  "broadcastToAll": false,
  "disabled": false,
  "evalWindow": "10m0s",
  "frequency": "2m0s",
  "severity": "critical",
  "labels": { ... },
  "annotations": { "summary": "...", "description": "..." },
  "condition": { ... },
  "preferredChannels": ["incidentio"]
}

Query Format (v5)

SigNoz v0.113 uses the v5 query builder format with a queries array (not the older builderQueries map):

json

"compositeQuery": {
  "queries": [
    {
      "type": "builder_query",
      "spec": {
        "name": "A",
        "signal": "metrics",
        "stepInterval": 60,
        "aggregations": [
          {
            "timeAggregation": "avg",
            "spaceAggregation": "avg",
            "metricName": "httpcheck.status"
          }
        ],
        "filter": {
          "expression": "http.url = 'https://example.jomcgi.dev'"
        },
        "groupBy": [],
        "order": [],
        "disabled": false
      }
    }
  ],
  "panelType": "graph",
  "queryType": "builder"
}

Key differences from older SigNoz versions:

queries is an array of {type, spec} objects (not a builderQueries map of {queryName, dataSource, ...})
Aggregation uses timeAggregation/spaceAggregation/metricName (not aggregateOperator/aggregateAttribute)
Filters use expression string syntax (not items array with key/op/value objects)

Threshold Configuration (CRITICAL)

Thresholds must be defined in two places on the condition object — both are required:

json

"condition": {
  "compositeQuery": { ... },
  "selectedQueryName": "A",
  "op": "2",
  "target": 1,
  "matchType": "5",
  "targetUnit": "",
  "thresholds": {
    "kind": "basic",
    "spec": [
      {
        "name": "critical",
        "target": 1,
        "targetUnit": "",
        "matchType": "5",
        "op": "2",
        "channels": ["incidentio"]
      }
    ]
  }
}

Why both are required: SigNoz v0.113's PostableRule.processRuleDefaults() defaults the internal schemaVersion to "v1" (this is separate from the top-level "version": "v5" which controls query format). For v1 schema, the threshold evaluation reads from the legacy condition-level fields (op, target, matchType, targetUnit), not from the thresholds block. Without the legacy fields, BasicRuleThreshold.TargetValue (*float64) is nil, causing a panic during evaluation when a query returns data.

The thresholds block is still needed for the SigNoz UI to render threshold configuration correctly.

Comparison Operators (`op`)

Value	Meaning	Use case
`"1"`	Greater than	Restart count > N
`"2"`	Less than	httpcheck.status < 1
`"3"`	Equal to	Pod phase == Pending
`"4"`	Not equal to	Status != expected

Match Types (`matchType`)

Value	Meaning	Use case
`"1"`	Once in eval window	OOMKilled (any occurrence)
`"3"`	Always in eval window	Node pressure (sustained)
`"5"`	N consecutive times (count = eval/frequency)	HTTPCheck (5 failures in 10min)

Alert Categories

HTTPCheck Alerts

Monitor service availability via the OTel httpcheck receiver. Pattern: avg(httpcheck.status) where http.url = '<url>', alert when < 1 for 5 consecutive checks.

Service	URL	Location
ArgoCD	`https://argocd.jomcgi.dev/healthz`	`overlays/cluster-critical/argocd/`
Longhorn	`https://longhorn.jomcgi.dev`	`overlays/cluster-critical/longhorn/`
SigNoz	`https://signoz.jomcgi.dev/api/v1/health`	`overlays/cluster-critical/signoz/`
hikes.jomcgi.dev	`https://hikes.jomcgi.dev`	`overlays/cluster-critical/signoz/`
jomcgi.dev	`https://jomcgi.dev`	`overlays/cluster-critical/signoz/`
trips pages	`https://trips.jomcgi.dev`	`overlays/cluster-critical/signoz/`
marine	`https://marine.jomcgi.dev/health`	`overlays/dev/marine/`
api-gateway	`https://api.jomcgi.dev/status.json`	`overlays/prod/api-gateway/`
todo	`https://todo.jomcgi.dev`	`overlays/prod/todo/`
todo-admin	`https://todo-admin.jomcgi.dev/health`	`overlays/prod/todo/`
img	`https://img.jomcgi.dev/health`	`overlays/prod/trips/`

ArgoCD App State Alerts

Monitor GitOps application health via the argocd_app_info metric. Pattern: count(argocd_app_info) where health_status = '<state>', grouped by app name and namespace.

Located in overlays/cluster-critical/signoz/alerts/:

argocd-app-degraded.yaml — health_status = Degraded (warning)
argocd-app-missing.yaml — health_status = Missing (critical)
argocd-app-outofsync.yaml — sync_status = OutOfSync (warning)
argocd-app-suspended.yaml — health_status = Suspended (warning)

Kubernetes Infrastructure Alerts

Monitor node and pod health via k8s receiver metrics.

Located in overlays/cluster-critical/signoz/alerts/:

node-disk-pressure.yaml — k8s.node.condition_disk_pressure > 0 always (warning)
node-memory-pressure.yaml — k8s.node.condition_memory_pressure > 0 always (warning)
node-pid-pressure.yaml — k8s.node.condition_pid_pressure > 0 always (warning)
node-not-ready.yaml — k8s.node.condition_ready < 1 for 5 consecutive (critical)
pod-oomkilled.yaml — k8s.container.restarts where last_terminated_reason = OOMKilled > 0 once (critical)
pod-pending.yaml — k8s.pod.phase == 1 (Pending) for 5 consecutive over 15min (warning)
pod-restart-rate.yaml — k8s.container.restarts > 3 once in 15min (warning)

Sidecar Architecture

The signoz-dashboard-sidecar (Go, in charts/signoz-dashboard-sidecar/) watches for ConfigMaps labeled signoz.io/alert: "true" across all namespaces. It:

Watches — Kubernetes watch on ConfigMaps with the alert label
Reconciles — Every 5 minutes, force-updates all known alerts (drift correction)
Syncs — POSTs to create or PUTs to update alerts via SigNoz's POST /api/v1/rules API
Tracks state — Stores {ConfigMap UID → AlertState(ID, ContentHash)} in a signoz-dashboard-sidecar-state ConfigMap

On 404 (alert deleted from SigNoz), the sidecar recreates the alert automatically.

Remaining Action Items

Add ~17 missing HTTPCheck alerts (#445, #444) — Cluster-critical and prod services still need monitoring
Add dead man's switch alert — Fire when count(httpcheck.status) == 0 over 10 minutes
Instrument sidecar logging — Sidecar logs (slog to stdout) are not collected by the OTel collector

Observability & Alerting ​

Current State ​

Alert ConfigMap Format (SigNoz v0.113) ​

Required Fields ​

Query Format (v5) ​

Threshold Configuration (CRITICAL) ​

Comparison Operators (op) ​

Match Types (matchType) ​

Alert Categories ​

HTTPCheck Alerts ​

ArgoCD App State Alerts ​

Kubernetes Infrastructure Alerts ​

Sidecar Architecture ​

Remaining Action Items ​

Observability & Alerting

Current State

Alert ConfigMap Format (SigNoz v0.113)

Required Fields

Query Format (v5)

Threshold Configuration (CRITICAL)

Comparison Operators (`op`)

Match Types (`matchType`)

Alert Categories

HTTPCheck Alerts

ArgoCD App State Alerts

Kubernetes Infrastructure Alerts

Sidecar Architecture

Remaining Action Items