Naming convention
All metrics use thedecdn_ prefix. Counters carry a _total suffix; histograms and gauges carry unit suffixes (_seconds, _bytes, _ratio).
Metric tiers
Metrics are classified as mandatory (M) or recommended (R).- Mandatory — your node must expose these at startup. They cover slash-safety and core operational health.
- Recommended — operationally useful but not required. Enable per your monitoring posture.
The eight subsystems
| Subsystem | Metric prefix | Examples |
|---|---|---|
| Delivery | decdn_streams_*, decdn_bytes_* | Stream open/close counts, bytes served |
| Cache | decdn_cache_* | Hits, misses, evictions, bytes cached, probe-hold saturation |
| Payment channels | decdn_channels_*, decdn_vouchers_* | Opens, closes, disputes, voucher signatures accepted |
| Gossip | decdn_gossip_*, decdn_peer_table_size | Announcements, peers seen, topic lag |
| Blacklist | decdn_blacklist_* | Sync lag, version behind, enforced rejections |
| Origin (origin-backed only) | decdn_origin_* | Pulls, bytes, errors by type |
| Probe | decdn_probe_* | Incoming probes, hold slots used/max |
| Slash-safety | various | Enumerated below |
The slash-risk metrics (all mandatory)
These seven metrics map directly to slash-risk conditions. All nodes must expose them.| Metric | Meaning | Alert threshold (recommended) |
|---|---|---|
decdn_probe_hold_violations_total | Blob evicted after you promised has_blob=true | > 0 in the last hour |
decdn_probe_hold_slots_used | In-flight probe-hold count | > 80% of _slots_max |
decdn_probe_hold_slots_max | Capacity | — |
decdn_rate_bounds_clamp_events_total | Your advertised rate was clamped to governance bounds | > 0 ever — investigate immediately |
decdn_blacklist_sync_lag_seconds | Time since last blacklist poll | > 300 s (5 min) |
decdn_blacklist_version_behind | How many blacklist versions you’re behind | > 1 |
decdn_slash_evidence_exposure_total | You self-detected a contradiction in your own behavior | > 0 ever — this is the “I am about to be slashed” alert |
decdn_slash_evidence_exposure_total is the critical one. The node monitors its own behavior against its own signed messages — if it detects inconsistency (e.g., served different bytes than the cache now contains), it increments this counter. A non-zero value means you have live evidence against yourself that someone could submit.
The /health endpoint
ready | degraded | not_ready:
ready— all seven acceptance criteria (staking & registration) are met.degraded— running but some non-critical condition is unmet (e.g., only 1 watchtower, should be 2+).not_ready— missing a mandatory readiness condition (no stake, no gossip subscription, blacklist too far behind).
ready should accept paid deliveries. Load balancers and orchestrators should drain degraded and not_ready nodes from traffic.
Where to expose them
Default::9090/metrics (Prometheus) and :9090/health. Both on the same port by default; split if you want different access controls.
Recommended alerting
- P1 (immediate response):
decdn_slash_evidence_exposure_total > 0decdn_blacklist_sync_lag_seconds > 600(10 min)decdn_probe_hold_slots_used / _slots_max > 0.95
- P2 (response within hours):
decdn_blacklist_sync_lag_seconds > 300(5 min)decdn_rate_bounds_clamp_events_total > 0- Watchtower heartbeat failure
- P3 (business hours):
- Cache hit rate anomalies
- Peer table under threshold
Structured logging
Usetracing with the JSON layer:
node_id, channel_id, hash, peer_id, offset, nonce. These make correlation across delivery + channel + gossip subsystems tractable.