Skip to main content

Naming convention

All metrics use the decdn_ prefix. Counters carry a _total suffix; histograms and gauges carry unit suffixes (_seconds, _bytes, _ratio).

Metric tiers

Metrics are classified as mandatory (M) or recommended (R).
  • Mandatory — your node must expose these at startup. They cover slash-safety and core operational health.
  • Recommended — operationally useful but not required. Enable per your monitoring posture.

The eight subsystems

SubsystemMetric prefixExamples
Deliverydecdn_streams_*, decdn_bytes_*Stream open/close counts, bytes served
Cachedecdn_cache_*Hits, misses, evictions, bytes cached, probe-hold saturation
Payment channelsdecdn_channels_*, decdn_vouchers_*Opens, closes, disputes, voucher signatures accepted
Gossipdecdn_gossip_*, decdn_peer_table_sizeAnnouncements, peers seen, topic lag
Blacklistdecdn_blacklist_*Sync lag, version behind, enforced rejections
Origin (origin-backed only)decdn_origin_*Pulls, bytes, errors by type
Probedecdn_probe_*Incoming probes, hold slots used/max
Slash-safetyvariousEnumerated below

The slash-risk metrics (all mandatory)

These seven metrics map directly to slash-risk conditions. All nodes must expose them.
MetricMeaningAlert threshold (recommended)
decdn_probe_hold_violations_totalBlob evicted after you promised has_blob=true> 0 in the last hour
decdn_probe_hold_slots_usedIn-flight probe-hold count> 80% of _slots_max
decdn_probe_hold_slots_maxCapacity
decdn_rate_bounds_clamp_events_totalYour advertised rate was clamped to governance bounds> 0 ever — investigate immediately
decdn_blacklist_sync_lag_secondsTime since last blacklist poll> 300 s (5 min)
decdn_blacklist_version_behindHow many blacklist versions you’re behind> 1
decdn_slash_evidence_exposure_totalYou self-detected a contradiction in your own behavior> 0 ever — this is the “I am about to be slashed” alert
decdn_slash_evidence_exposure_total is the critical one. The node monitors its own behavior against its own signed messages — if it detects inconsistency (e.g., served different bytes than the cache now contains), it increments this counter. A non-zero value means you have live evidence against yourself that someone could submit.

The /health endpoint

GET /health
Returns JSON with status ready | degraded | not_ready:
{
  "status": "ready",
  "version": "0.1.0",
  "peer_count": 128,
  "channels_open": 14,
  "channel_balances_total_usdc": "1234000000",
  "blacklist_sync_state": {
    "version": 42,
    "lag_seconds": 15
  },
  "watchtowers_registered": 3
}
Status semantics:
  • ready — all seven acceptance criteria (staking & registration) are met.
  • degraded — running but some non-critical condition is unmet (e.g., only 1 watchtower, should be 2+).
  • not_ready — missing a mandatory readiness condition (no stake, no gossip subscription, blacklist too far behind).
Only ready should accept paid deliveries. Load balancers and orchestrators should drain degraded and not_ready nodes from traffic.

Where to expose them

Default: :9090/metrics (Prometheus) and :9090/health. Both on the same port by default; split if you want different access controls.
  • P1 (immediate response):
    • decdn_slash_evidence_exposure_total > 0
    • decdn_blacklist_sync_lag_seconds > 600 (10 min)
    • decdn_probe_hold_slots_used / _slots_max > 0.95
  • P2 (response within hours):
    • decdn_blacklist_sync_lag_seconds > 300 (5 min)
    • decdn_rate_bounds_clamp_events_total > 0
    • Watchtower heartbeat failure
  • P3 (business hours):
    • Cache hit rate anomalies
    • Peer table under threshold

Structured logging

Use tracing with the JSON layer:
[log]
format = "json"
level  = "info"
Key log fields to standardize on: node_id, channel_id, hash, peer_id, offset, nonce. These make correlation across delivery + channel + gossip subsystems tractable.