Observability & Alerting
The homelab has a complete observability stack (SigNoz v0.113, OTel, Linkerd) with a fully operational alerting pipeline. 22 alert rules monitor infrastructure health, application availability, and GitOps state. Alerts are synced to SigNoz via the signoz-dashboard-sidecar and notify through Incident.io.
Current State
What works: 22 alert rules are synced, evaluating, and generating alert history in SigNoz v0.113. The signoz-dashboard-sidecar reconciles alert ConfigMaps every 5 minutes, creating or updating rules via the SigNoz API. Incident.io notification channel (incidentio) is configured via webhook, with Bearer token auth sourced from 1Password.
Alert inventory:
- 11 HTTPCheck alerts — monitor service availability via
httpcheck.statusmetric - 7 Kubernetes health alerts — node conditions (disk/memory/PID pressure, readiness), pod health (OOMKilled, pending, restart rate)
- 4 ArgoCD app state alerts — degraded, missing, out-of-sync, suspended
Remaining gaps:
- ~17 services lack HTTPCheck alerts (#445, #444)
- No dead man's switch for the httpcheck receiver itself
- Sidecar's own logs are not collected by the OTel collector (uses
slogto stdout but not instrumented)
Alert ConfigMap Format (SigNoz v0.113)
All alerts are Kubernetes ConfigMaps with the signoz.io/alert: "true" label, discovered and synced by the signoz-dashboard-sidecar. The alert JSON lives in data.alert.json.
Required Fields
{
"alert": "Alert Name",
"alertType": "METRICS_BASED_ALERT",
"ruleType": "threshold_rule",
"version": "v5",
"broadcastToAll": false,
"disabled": false,
"evalWindow": "10m0s",
"frequency": "2m0s",
"severity": "critical",
"labels": { ... },
"annotations": { "summary": "...", "description": "..." },
"condition": { ... },
"preferredChannels": ["incidentio"]
}Query Format (v5)
SigNoz v0.113 uses the v5 query builder format with a queries array (not the older builderQueries map):
"compositeQuery": {
"queries": [
{
"type": "builder_query",
"spec": {
"name": "A",
"signal": "metrics",
"stepInterval": 60,
"aggregations": [
{
"timeAggregation": "avg",
"spaceAggregation": "avg",
"metricName": "httpcheck.status"
}
],
"filter": {
"expression": "http.url = 'https://example.jomcgi.dev'"
},
"groupBy": [],
"order": [],
"disabled": false
}
}
],
"panelType": "graph",
"queryType": "builder"
}Key differences from older SigNoz versions:
queriesis an array of{type, spec}objects (not abuilderQueriesmap of{queryName, dataSource, ...})- Aggregation uses
timeAggregation/spaceAggregation/metricName(notaggregateOperator/aggregateAttribute) - Filters use
expressionstring syntax (notitemsarray withkey/op/valueobjects)
Threshold Configuration (CRITICAL)
Thresholds must be defined in two places on the condition object — both are required:
"condition": {
"compositeQuery": { ... },
"selectedQueryName": "A",
"op": "2",
"target": 1,
"matchType": "5",
"targetUnit": "",
"thresholds": {
"kind": "basic",
"spec": [
{
"name": "critical",
"target": 1,
"targetUnit": "",
"matchType": "5",
"op": "2",
"channels": ["incidentio"]
}
]
}
}Why both are required: SigNoz v0.113's PostableRule.processRuleDefaults() defaults the internal schemaVersion to "v1" (this is separate from the top-level "version": "v5" which controls query format). For v1 schema, the threshold evaluation reads from the legacy condition-level fields (op, target, matchType, targetUnit), not from the thresholds block. Without the legacy fields, BasicRuleThreshold.TargetValue (*float64) is nil, causing a panic during evaluation when a query returns data.
The thresholds block is still needed for the SigNoz UI to render threshold configuration correctly.
Comparison Operators (op)
| Value | Meaning | Use case |
|---|---|---|
"1" | Greater than | Restart count > N |
"2" | Less than | httpcheck.status < 1 |
"3" | Equal to | Pod phase == Pending |
"4" | Not equal to | Status != expected |
Match Types (matchType)
| Value | Meaning | Use case |
|---|---|---|
"1" | Once in eval window | OOMKilled (any occurrence) |
"3" | Always in eval window | Node pressure (sustained) |
"5" | N consecutive times (count = eval/frequency) | HTTPCheck (5 failures in 10min) |
Alert Categories
HTTPCheck Alerts
Monitor service availability via the OTel httpcheck receiver. Pattern: avg(httpcheck.status) where http.url = '<url>', alert when < 1 for 5 consecutive checks.
| Service | URL | Location |
|---|---|---|
| ArgoCD | https://argocd.jomcgi.dev/healthz | overlays/cluster-critical/argocd/ |
| Longhorn | https://longhorn.jomcgi.dev | overlays/cluster-critical/longhorn/ |
| SigNoz | https://signoz.jomcgi.dev/api/v1/health | overlays/cluster-critical/signoz/ |
| hikes.jomcgi.dev | https://hikes.jomcgi.dev | overlays/cluster-critical/signoz/ |
| jomcgi.dev | https://jomcgi.dev | overlays/cluster-critical/signoz/ |
| trips pages | https://trips.jomcgi.dev | overlays/cluster-critical/signoz/ |
| marine | https://marine.jomcgi.dev/health | overlays/dev/marine/ |
| api-gateway | https://api.jomcgi.dev/status.json | overlays/prod/api-gateway/ |
| todo | https://todo.jomcgi.dev | overlays/prod/todo/ |
| todo-admin | https://todo-admin.jomcgi.dev/health | overlays/prod/todo/ |
| img | https://img.jomcgi.dev/health | overlays/prod/trips/ |
ArgoCD App State Alerts
Monitor GitOps application health via the argocd_app_info metric. Pattern: count(argocd_app_info) where health_status = '<state>', grouped by app name and namespace.
Located in overlays/cluster-critical/signoz/alerts/:
argocd-app-degraded.yaml— health_status = Degraded (warning)argocd-app-missing.yaml— health_status = Missing (critical)argocd-app-outofsync.yaml— sync_status = OutOfSync (warning)argocd-app-suspended.yaml— health_status = Suspended (warning)
Kubernetes Infrastructure Alerts
Monitor node and pod health via k8s receiver metrics.
Located in overlays/cluster-critical/signoz/alerts/:
node-disk-pressure.yaml—k8s.node.condition_disk_pressure> 0 always (warning)node-memory-pressure.yaml—k8s.node.condition_memory_pressure> 0 always (warning)node-pid-pressure.yaml—k8s.node.condition_pid_pressure> 0 always (warning)node-not-ready.yaml—k8s.node.condition_ready< 1 for 5 consecutive (critical)pod-oomkilled.yaml—k8s.container.restartswherelast_terminated_reason = OOMKilled> 0 once (critical)pod-pending.yaml—k8s.pod.phase== 1 (Pending) for 5 consecutive over 15min (warning)pod-restart-rate.yaml—k8s.container.restarts> 3 once in 15min (warning)
Sidecar Architecture
The signoz-dashboard-sidecar (Go, in charts/signoz-dashboard-sidecar/) watches for ConfigMaps labeled signoz.io/alert: "true" across all namespaces. It:
- Watches — Kubernetes watch on ConfigMaps with the alert label
- Reconciles — Every 5 minutes, force-updates all known alerts (drift correction)
- Syncs — POSTs to create or PUTs to update alerts via SigNoz's
POST /api/v1/rulesAPI - Tracks state — Stores
{ConfigMap UID → AlertState(ID, ContentHash)}in asignoz-dashboard-sidecar-stateConfigMap
On 404 (alert deleted from SigNoz), the sidecar recreates the alert automatically.
Remaining Action Items
- Add ~17 missing HTTPCheck alerts (#445, #444) — Cluster-critical and prod services still need monitoring
- Add dead man's switch alert — Fire when
count(httpcheck.status) == 0over 10 minutes - Instrument sidecar logging — Sidecar logs (
slogto stdout) are not collected by the OTel collector