Skip to content

Observability & Alerting

The homelab has a complete observability stack (SigNoz v0.113, OTel, Linkerd) with a fully operational alerting pipeline. 22 alert rules monitor infrastructure health, application availability, and GitOps state. Alerts are synced to SigNoz via the signoz-dashboard-sidecar and notify through Incident.io.

Current State

What works: 22 alert rules are synced, evaluating, and generating alert history in SigNoz v0.113. The signoz-dashboard-sidecar reconciles alert ConfigMaps every 5 minutes, creating or updating rules via the SigNoz API. Incident.io notification channel (incidentio) is configured via webhook, with Bearer token auth sourced from 1Password.

Alert inventory:

  • 11 HTTPCheck alerts — monitor service availability via httpcheck.status metric
  • 7 Kubernetes health alerts — node conditions (disk/memory/PID pressure, readiness), pod health (OOMKilled, pending, restart rate)
  • 4 ArgoCD app state alerts — degraded, missing, out-of-sync, suspended

Remaining gaps:

  • ~17 services lack HTTPCheck alerts (#445, #444)
  • No dead man's switch for the httpcheck receiver itself
  • Sidecar's own logs are not collected by the OTel collector (uses slog to stdout but not instrumented)

Alert ConfigMap Format (SigNoz v0.113)

All alerts are Kubernetes ConfigMaps with the signoz.io/alert: "true" label, discovered and synced by the signoz-dashboard-sidecar. The alert JSON lives in data.alert.json.

Required Fields

json
{
  "alert": "Alert Name",
  "alertType": "METRICS_BASED_ALERT",
  "ruleType": "threshold_rule",
  "version": "v5",
  "broadcastToAll": false,
  "disabled": false,
  "evalWindow": "10m0s",
  "frequency": "2m0s",
  "severity": "critical",
  "labels": { ... },
  "annotations": { "summary": "...", "description": "..." },
  "condition": { ... },
  "preferredChannels": ["incidentio"]
}

Query Format (v5)

SigNoz v0.113 uses the v5 query builder format with a queries array (not the older builderQueries map):

json
"compositeQuery": {
  "queries": [
    {
      "type": "builder_query",
      "spec": {
        "name": "A",
        "signal": "metrics",
        "stepInterval": 60,
        "aggregations": [
          {
            "timeAggregation": "avg",
            "spaceAggregation": "avg",
            "metricName": "httpcheck.status"
          }
        ],
        "filter": {
          "expression": "http.url = 'https://example.jomcgi.dev'"
        },
        "groupBy": [],
        "order": [],
        "disabled": false
      }
    }
  ],
  "panelType": "graph",
  "queryType": "builder"
}

Key differences from older SigNoz versions:

  • queries is an array of {type, spec} objects (not a builderQueries map of {queryName, dataSource, ...})
  • Aggregation uses timeAggregation/spaceAggregation/metricName (not aggregateOperator/aggregateAttribute)
  • Filters use expression string syntax (not items array with key/op/value objects)

Threshold Configuration (CRITICAL)

Thresholds must be defined in two places on the condition object — both are required:

json
"condition": {
  "compositeQuery": { ... },
  "selectedQueryName": "A",
  "op": "2",
  "target": 1,
  "matchType": "5",
  "targetUnit": "",
  "thresholds": {
    "kind": "basic",
    "spec": [
      {
        "name": "critical",
        "target": 1,
        "targetUnit": "",
        "matchType": "5",
        "op": "2",
        "channels": ["incidentio"]
      }
    ]
  }
}

Why both are required: SigNoz v0.113's PostableRule.processRuleDefaults() defaults the internal schemaVersion to "v1" (this is separate from the top-level "version": "v5" which controls query format). For v1 schema, the threshold evaluation reads from the legacy condition-level fields (op, target, matchType, targetUnit), not from the thresholds block. Without the legacy fields, BasicRuleThreshold.TargetValue (*float64) is nil, causing a panic during evaluation when a query returns data.

The thresholds block is still needed for the SigNoz UI to render threshold configuration correctly.

Comparison Operators (op)

ValueMeaningUse case
"1"Greater thanRestart count > N
"2"Less thanhttpcheck.status < 1
"3"Equal toPod phase == Pending
"4"Not equal toStatus != expected

Match Types (matchType)

ValueMeaningUse case
"1"Once in eval windowOOMKilled (any occurrence)
"3"Always in eval windowNode pressure (sustained)
"5"N consecutive times (count = eval/frequency)HTTPCheck (5 failures in 10min)

Alert Categories

HTTPCheck Alerts

Monitor service availability via the OTel httpcheck receiver. Pattern: avg(httpcheck.status) where http.url = '<url>', alert when < 1 for 5 consecutive checks.

ServiceURLLocation
ArgoCDhttps://argocd.jomcgi.dev/healthzoverlays/cluster-critical/argocd/
Longhornhttps://longhorn.jomcgi.devoverlays/cluster-critical/longhorn/
SigNozhttps://signoz.jomcgi.dev/api/v1/healthoverlays/cluster-critical/signoz/
hikes.jomcgi.devhttps://hikes.jomcgi.devoverlays/cluster-critical/signoz/
jomcgi.devhttps://jomcgi.devoverlays/cluster-critical/signoz/
trips pageshttps://trips.jomcgi.devoverlays/cluster-critical/signoz/
marinehttps://marine.jomcgi.dev/healthoverlays/dev/marine/
api-gatewayhttps://api.jomcgi.dev/status.jsonoverlays/prod/api-gateway/
todohttps://todo.jomcgi.devoverlays/prod/todo/
todo-adminhttps://todo-admin.jomcgi.dev/healthoverlays/prod/todo/
imghttps://img.jomcgi.dev/healthoverlays/prod/trips/

ArgoCD App State Alerts

Monitor GitOps application health via the argocd_app_info metric. Pattern: count(argocd_app_info) where health_status = '<state>', grouped by app name and namespace.

Located in overlays/cluster-critical/signoz/alerts/:

  • argocd-app-degraded.yaml — health_status = Degraded (warning)
  • argocd-app-missing.yaml — health_status = Missing (critical)
  • argocd-app-outofsync.yaml — sync_status = OutOfSync (warning)
  • argocd-app-suspended.yaml — health_status = Suspended (warning)

Kubernetes Infrastructure Alerts

Monitor node and pod health via k8s receiver metrics.

Located in overlays/cluster-critical/signoz/alerts/:

  • node-disk-pressure.yamlk8s.node.condition_disk_pressure > 0 always (warning)
  • node-memory-pressure.yamlk8s.node.condition_memory_pressure > 0 always (warning)
  • node-pid-pressure.yamlk8s.node.condition_pid_pressure > 0 always (warning)
  • node-not-ready.yamlk8s.node.condition_ready < 1 for 5 consecutive (critical)
  • pod-oomkilled.yamlk8s.container.restarts where last_terminated_reason = OOMKilled > 0 once (critical)
  • pod-pending.yamlk8s.pod.phase == 1 (Pending) for 5 consecutive over 15min (warning)
  • pod-restart-rate.yamlk8s.container.restarts > 3 once in 15min (warning)

Sidecar Architecture

The signoz-dashboard-sidecar (Go, in charts/signoz-dashboard-sidecar/) watches for ConfigMaps labeled signoz.io/alert: "true" across all namespaces. It:

  1. Watches — Kubernetes watch on ConfigMaps with the alert label
  2. Reconciles — Every 5 minutes, force-updates all known alerts (drift correction)
  3. Syncs — POSTs to create or PUTs to update alerts via SigNoz's POST /api/v1/rules API
  4. Tracks state — Stores {ConfigMap UID → AlertState(ID, ContentHash)} in a signoz-dashboard-sidecar-state ConfigMap

On 404 (alert deleted from SigNoz), the sidecar recreates the alert automatically.

Remaining Action Items

  1. Add ~17 missing HTTPCheck alerts (#445, #444) — Cluster-critical and prod services still need monitoring
  2. Add dead man's switch alert — Fire when count(httpcheck.status) == 0 over 10 minutes
  3. Instrument sidecar logging — Sidecar logs (slog to stdout) are not collected by the OTel collector