Observability

Naming convention

All metrics use the decdn_ prefix. Counters carry a _total suffix; histograms and gauges carry unit suffixes (_seconds, _bytes, _ratio).

Metric tiers

Metrics are classified as mandatory (M) or recommended (R).

Mandatory — your node must expose these at startup. They cover slash-safety and core operational health.
Recommended — operationally useful but not required. Enable per your monitoring posture.

The eight subsystems

Subsystem	Metric prefix	Examples
Delivery	`decdn_streams_`, `decdn_bytes_`	Stream open/close counts, bytes served
Cache	`decdn_cache_*`	Hits, misses, evictions, bytes cached, probe-hold saturation
Payment channels	`decdn_channels_`, `decdn_vouchers_`	Opens, closes, disputes, voucher signatures accepted
Gossip	`decdn_gossip_*`, `decdn_peer_table_size`	Announcements, peers seen, topic lag
Blacklist	`decdn_blacklist_*`	Sync lag, version behind, enforced rejections
Origin (origin-backed only)	`decdn_origin_*`	Pulls, bytes, errors by type
Probe	`decdn_probe_*`	Incoming probes, hold slots used/max
Slash-safety	various	Enumerated below

The slash-risk metrics (all mandatory)

These seven metrics map directly to slash-risk conditions. All nodes must expose them.

Metric	Meaning	Alert threshold (recommended)
`decdn_probe_hold_violations_total`	Blob evicted after you promised `has_blob=true`	> 0 in the last hour
`decdn_probe_hold_slots_used`	In-flight probe-hold count	> 80% of `_slots_max`
`decdn_probe_hold_slots_max`	Capacity	—
`decdn_rate_bounds_clamp_events_total`	Your advertised rate was clamped to governance bounds	> 0 ever — investigate immediately
`decdn_blacklist_sync_lag_seconds`	Time since last blacklist poll	> 300 s (5 min)
`decdn_blacklist_version_behind`	How many blacklist versions you’re behind	> 1
`decdn_slash_evidence_exposure_total`	You self-detected a contradiction in your own behavior	> 0 ever — this is the “I am about to be slashed” alert

decdn_slash_evidence_exposure_total is the critical one. The node monitors its own behavior against its own signed messages — if it detects inconsistency (e.g., served different bytes than the cache now contains), it increments this counter. A non-zero value means you have live evidence against yourself that someone could submit.

The `/health` endpoint

GET /health

Returns JSON with status ready | degraded | not_ready:

{
  "status": "ready",
  "version": "0.1.0",
  "peer_count": 128,
  "channels_open": 14,
  "channel_balances_total_usdc": "1234000000",
  "blacklist_sync_state": {
    "version": 42,
    "lag_seconds": 15
  },
  "watchtowers_registered": 3
}

Status semantics:

ready — all seven acceptance criteria (staking & registration) are met.
degraded — running but some non-critical condition is unmet (e.g., only 1 watchtower, should be 2+).
not_ready — missing a mandatory readiness condition (no stake, no gossip subscription, blacklist too far behind).

Only ready should accept paid deliveries. Load balancers and orchestrators should drain degraded and not_ready nodes from traffic.

Where to expose them

Default: :9090/metrics (Prometheus) and :9090/health. Both on the same port by default; split if you want different access controls.

Recommended alerting

P1 (immediate response):
- decdn_slash_evidence_exposure_total > 0
- decdn_blacklist_sync_lag_seconds > 600 (10 min)
- decdn_probe_hold_slots_used / _slots_max > 0.95
P2 (response within hours):
- decdn_blacklist_sync_lag_seconds > 300 (5 min)
- decdn_rate_bounds_clamp_events_total > 0
- Watchtower heartbeat failure
P3 (business hours):
- Cache hit rate anomalies
- Peer table under threshold

Structured logging

Use tracing with the JSON layer:

[log]
format = "json"
level  = "info"

Key log fields to standardize on: node_id, channel_id, hash, peer_id, offset, nonce. These make correlation across delivery + channel + gossip subsystems tractable.

Node Operators

​Naming convention

​Metric tiers

​The eight subsystems

​The slash-risk metrics (all mandatory)

​The /health endpoint

​Where to expose them

​Recommended alerting

​Structured logging