OC · Monitoring Ops
Post-deploy observability — uptime, errors, alerts, incidents. v1.2 opens PM incident tickets when alerts fire.
/oc-monitor/oc-monitor setup/oc-monitor stack/oc-monitor instrument/oc-monitor health/oc-monitor errors/oc-monitor uptime/oc-monitor metrics/oc-monitor alerts/oc-monitor oncall/oc-monitor slo/oc-monitor incident/oc-monitor runbook/oc-monitor postmortem/oc-monitor dashboard/oc-monitor report/oc-monitor audit/oc-monitor compare/oc-monitor status build
Drop the bundle into .claude/skills/ and Claude Code auto-discovers it on the next session — or point Codex / any MCP agent at the hosted opchain.dev/mcp endpoint.
How you'll use it
Post-deployment observability: uptime monitoring, error tracking, structured logging, alerting pipelines, and incident response runbooks. Sits after oc-deploy-ops in the pipeline — oc-deploy-ops ships it, oc-monitoring-ops watches it. Use for /oc-monitor, "set up monitoring", "error tracking", "uptime check", "alerting", "incident response", "observability", "what's happening in prod", "set up Sentry", "logging strategy", "on-call", "runbook", "SLO", "SLI", "is prod healthy", "why is it slow", "error rate", "status page". Trigger liberally.
Trigger with natural language or a slash command:
/oc-monitor/oc-monitor setup/oc-monitor stack/oc-monitor instrument/oc-monitor health/oc-monitor errors +13 more On this page
Monitoring Ops
On first invocation, read references/orchestrator.md and follow its welcome protocol.
Post-deployment observability skill. Deploy-ops ships the code; oc-monitoring-ops watches it run. Covers five domains: uptime monitoring, error tracking, structured logging, alerting pipelines, and incident response.
This skill does NOT build features or deploy code — it instruments what’s already deployed and establishes the feedback loop that catches problems before users report them.
/oc-monitor — Command Reference
MONITORING OPS COMMANDS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SETUP
/oc-monitor setup Guided observability stack setup for a project
/oc-monitor stack Recommend monitoring tools for the stack
/oc-monitor instrument Add structured logging + error capture to codebase
OBSERVE
/oc-monitor health Live health check — hit endpoints, report status
/oc-monitor errors Check error tracking service for recent issues
/oc-monitor uptime Show uptime status and recent incidents
/oc-monitor metrics Key metrics snapshot (latency, error rate, throughput)
/oc-monitor status Current monitoring state from checkpoint
ALERT
/oc-monitor alerts Design or audit alerting rules
/oc-monitor oncall Set up on-call rotation and escalation
/oc-monitor slo Define or review SLOs/SLIs/error budgets
RESPOND
/oc-monitor incident Start or review an incident response
/oc-monitor runbook Generate or update operational runbooks
/oc-monitor postmortem Structured post-incident review
REPORT
/oc-monitor dashboard Design an ops monitoring dashboard (routes to oc-dash-forge)
/oc-monitor report Generate weekly/monthly ops report
/oc-monitor audit Full observability maturity assessment
/oc-monitor compare Compare two monitoring snapshots (drift detection)
SESSION
/checkpoint Show checkpoint status
/checkpoint show Display full checkpoint JSON
/checkpoint reset Archive and restart
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Type any command to begin. /oc-monitor to see this again.
Session Persistence (Checkpoint Protocol)
Checkpoint: {project-dir}/.checkpoints/oc-monitoring-ops.checkpoint.json
Resume on Start
When any /oc-monitor command is invoked:
- Check for checkpoint
- If exists: show tier, tools, SLO status, active incidents, maturity grade, next action
- Ask: “Continue, restart, or show full checkpoint?”
- On continue: load context, resume from
next_actions[0]
What’s Tracked
Monitoring-ops uniquely stores runtime state alongside pipeline progress — the last health check result, active incident count, and SLO budget consumption. This means the checkpoint serves double duty: session persistence AND operational snapshot.
How This Skill Fits the Pipeline
oc-reverse-spec → oc-app-architect → oc-git-ops → oc-deploy-ops → MONITORING-OPS
│
┌─────────────┤
│ │
oc-security-auditor oc-scale-ops
(detection/ (perf
response input) budgets)
oc-deploy-ops ships it, oc-monitoring-ops watches it. The handoff:
- oc-deploy-ops completes production promotion
- oc-deploy-ops runs health check (basic HTTP 200 verification)
- If oc-monitoring-ops checkpoint exists: oc-monitoring-ops takes over ongoing observation
- If not: oc-deploy-ops suggests
/oc-monitor setupfor the project
oc-deploy-ops’s health check is a one-shot verification. oc-monitoring-ops provides continuous observation, error aggregation, alerting, and incident coordination.
Cross-Skill Connections
| Skill | Relationship |
|---|---|
| oc-deploy-ops | Upstream. oc-deploy-ops ships → oc-monitoring-ops watches. Shares health check URLs, environment config. |
| oc-security-auditor | Peer. oc-security-auditor’s Pillar 3 (detection/response) maps directly to oc-monitoring-ops’s alerting + incident response. oc-security-auditor defines WHAT to detect; oc-monitoring-ops implements HOW to detect it. |
| oc-scale-ops | Peer. oc-scale-ops sets performance budgets; oc-monitoring-ops enforces them via alerting. Latency SLOs from oc-scale-ops become oc-monitoring-ops alert thresholds. |
| oc-code-auditor | Upstream consumer. oc-code-auditor’s /oc-audit pre-deploy findings can include “missing error handling” — oc-monitoring-ops’s /oc-monitor instrument addresses the gap at the observability layer. |
| oc-app-architect | Upstream. Reads spec for expected behaviors, user flows, and error handling strategy to inform what to monitor. |
| oc-dash-forge | Downstream for visualization. /oc-monitor dashboard routes to oc-dash-forge with an ops archetype context for monitoring UI design. |
Observability Maturity Model
Every project maps to a maturity tier. The setup wizard determines the tier from the project’s scale, sensitivity, and infrastructure.
| Tier | Name | Who | What’s Covered | Example |
|---|---|---|---|---|
| T0 | Bare Minimum | Solo dev, personal app | Health endpoint + basic logging + crash alerts | aidops apps on free tier |
| T1 | Foundations | Small team, internal tool | T0 + error tracking + uptime monitoring + structured logs | PenThreshold |
| T2 | Production | Multi-user product, SLA exists | T1 + SLOs + alerting pipeline + runbooks + dashboards | SaaS MVP |
| T3 | Operational | Revenue-bearing, on-call required | T2 + incident response + post-mortems + distributed tracing | Scaled product |
Default for aidops-scale apps: T0 or T1. Don’t overengineer monitoring for a 2-user app. The setup wizard auto-detects the appropriate tier.
Phase 1: Setup (/oc-monitor setup)
Guided setup that instruments a project for observability. Reads existing config (oc-deploy-ops checkpoint, wrangler.toml, package.json) to avoid asking questions with known answers.
Setup Wizard
Use ask_user_input for remaining unknowns:
- What tier? (auto-detect from oc-scale-ops/oc-deploy-ops context, confirm)
- Error tracking provider? (Sentry, LogRocket, Highlight, BetterStack, or console-only)
- Uptime monitoring? (BetterStack, UptimeRobot, Checkly, or CF Health Checks)
- Alerting channel? (Telegram, Slack, Discord, email, PagerDuty)
- Status page? (BetterStack, Instatus, or none)
Setup Deliverables
| Artifact | Description |
|---|---|
monitoring/config.md | Monitoring strategy document — what’s monitored, thresholds, tools |
monitoring/runbooks/ | Operational runbooks (at least: deploy failure, error spike, downtime) |
src/lib/logger.ts (or equivalent) | Structured logging utility |
src/middleware/monitoring.ts | Request tracking, error capture, latency measurement |
.monitoring.json | Tool configuration (URLs, check IDs, alert channels) |
Health Endpoint Contract
Every monitored app needs a GET /api/health endpoint. This is the shared contract
between oc-deploy-ops (one-shot check), oc-monitoring-ops (continuous check), and external
uptime monitors.
Response shape:
{
"status": "ok | degraded | unhealthy",
"checks": { "database": "ok", "kv": "ok" },
"version": "1.2.3",
"timestamp": "2026-04-23T10:00:00Z"
}
Rules:
- Returns 200 when all checks pass, 503 when any check is “down”
- Each dependency (DB, KV, external API) gets its own named check
- Include deployed version for deploy-correlation
- Include per-check latency at T2+ (for oc-scale-ops integration)
Read references/instrumentation.md for full implementation patterns per stack
(Workers, Next.js, FastAPI), including structured logger, request monitoring middleware,
and Sentry transport for Workers.
Phase 2: Observe
Health Check (/oc-monitor health)
Goes deeper than oc-deploy-ops’s one-shot 200 check:
- HTTP status — hit
/api/health, verify 200 - Per-dependency status — parse response body, check each named service (DB, KV, etc.)
- Latency profile — 5 samples, report avg and max
- Certificate expiry — TLS cert days remaining
- Version correlation — compare deployed version to last-known-good
Output uses the Health Check Report format from references/output-templates.md.
Error Tracking (/oc-monitor errors)
Checks the configured error tracking service. Adapts to what’s available:
| Service | Method | What’s Reported |
|---|---|---|
| Sentry | Sentry Issues API | Unresolved count, top 10 by recency, severity breakdown |
| Console-only | wrangler tail structured log grep | Recent error-level entries, deduplication by message |
| BetterStack / Highlight | Service-specific API | Unresolved issues + error trends |
Metrics Snapshot (/oc-monitor metrics)
Aggregate key metrics from available sources:
PRODUCTION METRICS — [project]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Uptime (30d): 99.87% (2h 17m total downtime)
Error rate: 0.3% (12 errors / 4,200 requests today)
p50 latency: 42ms
p95 latency: 187ms
p99 latency: 890ms
Active errors: 3 unresolved (1 HIGH, 2 LOW)
Last deploy: 2 days ago (v1.2.3)
Next cert expiry: 47 days
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Status (/oc-monitor status)
Reads the checkpoint and displays the current monitoring state without hitting any live endpoints. This is a checkpoint read, not a health check.
MONITORING STATUS — [project]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Tier: T1 (Foundations)
Maturity: C (50%)
Tools: Sentry + BetterStack + Telegram
SLOs: 99% avail / 1s p95 / 1% errors
Budget: 86% remaining (24 days left in window)
Last health: 2 min ago — ✅ ok (42ms)
Incidents: 0 active
Runbooks: 3 of 6 alert-mapped
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Phase 3: Alert (/oc-monitor alerts)
Alert Design Principles
- Every alert must have an action. If you can’t do anything when it fires, it’s noise, not an alert. Remove it.
- Two thresholds: warn and critical. Warn = investigate soon. Critical = act now.
- Alert on symptoms, not causes. “Error rate > 5%” is better than “database connection pool exhausted” — the symptom catches more failure modes.
- Pair every alert with a runbook. The alert says “something’s wrong.” The runbook says “here’s what to do.”
Standard Alert Set (T1+)
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
| Endpoint down | Health check fails 2 consecutive times | CRITICAL | runbooks/downtime.md |
| High error rate | >5% of requests return 5xx in 5-min window | CRITICAL | runbooks/error-spike.md |
| Elevated error rate | >1% of requests return 5xx in 15-min window | WARN | runbooks/error-spike.md |
| High latency | p95 > 2s for 5 minutes | WARN | runbooks/latency.md |
| Certificate expiring | SSL cert expires within 14 days | WARN | runbooks/cert-renewal.md |
| Error budget burn | >50% of monthly error budget consumed | WARN | runbooks/error-budget.md |
| Deploy failure | CI/CD pipeline fails on production | CRITICAL | runbooks/deploy-failure.md |
SLO Definition (/oc-monitor slo)
Service Level Objectives quantify “good enough.” They prevent over-engineering reliability for apps that don’t need five nines.
| Metric (SLI) | T0 Target | T1 Target | T2 Target | T3 Target |
|---|---|---|---|---|
| Availability | 95% (36h down/mo) | 99% (7h down/mo) | 99.9% (43m down/mo) | 99.95% (22m down/mo) |
| Latency (p95) | <2s | <1s | <500ms | <200ms |
| Error rate | <5% | <1% | <0.1% | <0.05% |
| Data freshness | Best effort | <5 min | <1 min | <30s |
Error budget = 100% − SLO. If your SLO is 99% availability, you have 7.3 hours of downtime per month before you violate it. Spend that budget on deploys and feature velocity, not perfection.
Read references/alerting-patterns.md for channel setup (Telegram, Slack, PagerDuty),
alert transport code, and SLO burn-rate alerting math.
Phase 4: Respond (/oc-monitor incident)
Read references/incident-response.md for full incident template, runbook template,
and post-mortem template.
Incident Lifecycle
ALERT FIRES
│
▼
ACKNOWLEDGE (who's looking at it?)
│
▼
DIAGNOSE (what's broken? use runbook decision tree)
│
├──► QUICK FIX (restart, rollback, scale) ─────┐
│ │
├──► FIX FORWARD (patch + deploy) ──────────────┤
│ │
└──► ESCALATE (beyond current expertise) ───────┤
▼
MITIGATE (service restored)
│
▼
RESOLVE (root cause addressed)
│
▼
POST-MORTEM (blameless, action items)
Key Files
| File | Purpose |
|---|---|
monitoring/runbooks/downtime.md | Endpoint unreachable — diagnosis + recovery |
monitoring/runbooks/error-spike.md | Error rate elevated — triage + mitigation |
monitoring/runbooks/deploy-failure.md | Deploy broke prod — rollback procedure |
monitoring/runbooks/latency.md | Response times degraded — diagnosis |
monitoring/incidents/ | Incident logs (one file per incident) |
monitoring/postmortems/ | Post-incident reviews |
Observability Audit (/oc-monitor audit)
Full maturity assessment. Scores each domain and recommends the next upgrade.
Audit Domains
| Domain | Weight | What it measures |
|---|---|---|
| Logging | 20% | Structured logs, context propagation, retention, searchability |
| Error Tracking | 20% | Capture rate, grouping, alerting, triage workflow |
| Uptime | 20% | Monitoring coverage, check frequency, historical availability |
| Alerting | 20% | Alert quality (action per alert), channel reliability, runbook coverage |
| Incident Response | 20% | Runbooks exist, post-mortem practice, on-call defined |
Tier-adaptive grading: At T0, incident response and alerting carry reduced weight (a solo dev doesn’t need PagerDuty). At T2+, all five domains are equally critical. The audit grades against the project’s declared tier, not against T3 universally.
Maturity Scoring
| Score | Meaning |
|---|---|
| A (90%+) | Production-ready operations — you’d sleep through deploys |
| B (70-89%) | Solid foundations — most issues caught automatically |
| C (50-69%) | Basic coverage — you find out about problems, eventually |
| D (30-49%) | Reactive — users report problems before you see them |
| F (<30%) | No observability — flying blind |
See references/output-templates.md for full audit report format, ops report template,
and metrics snapshot format.
Dashboard Routing (/oc-monitor dashboard)
When the user wants a monitoring dashboard, route to oc-dash-forge with ops archetype pre-selected:
Detected monitoring dashboard request. Routing to oc-dash-forge with ops archetype.
oc-dash-forge will produce:
- High-density ops layout (dark mode, tight tiles)
- Real-time KPI tiles (error rate, latency, throughput, uptime)
- Incident table, event feed, throughput chart
- Working React prototype with mock data
Proceeding to /oc-data-forge with ops context...
Package the following context for oc-dash-forge:
- Archetype: ops (pre-selected, skip interview)
- Data sources: health endpoint, error tracking, uptime service, structured logs
- KPIs: error rate, p95 latency, uptime %, active incidents, deploy recency
- Refresh cadence: near-real-time (30s-2min)
Ops Report (/oc-monitor report)
Generate a periodic operations summary covering availability, performance, errors, SLO status, deploys, and incidents. Cadence depends on tier: T1 = monthly, T2+ = weekly.
The report pulls from: oc-monitoring-ops checkpoint (SLO budget, incident count), oc-deploy-ops checkpoint (deploy count, rollbacks), and error tracking service (error counts, top issues).
See references/output-templates.md for the full report template.
Snapshot Comparison (/oc-monitor compare)
Compares two monitoring snapshots to track observability posture over time. Mirrors
oc-security-auditor’s /oc-security compare pattern.
Input: two checkpoint timestamps or dates. If one argument, compares against current state.
Diffs: maturity grade, domain scores, SLO budget consumption, incident count, runbook
coverage. Highlights regressions prominently. Uses the comparison format from
references/output-templates.md.
File Structure
project-dir/
├── .checkpoints/
│ └── oc-monitoring-ops.checkpoint.json
├── .monitoring.json # Tool config (URLs, check IDs, channels)
├── monitoring/
│ ├── config.md # Strategy: what's monitored, thresholds, tools
│ ├── runbooks/
│ │ ├── downtime.md
│ │ ├── error-spike.md
│ │ ├── deploy-failure.md
│ │ ├── latency.md
│ │ ├── cert-renewal.md
│ │ └── error-budget.md
│ ├── incidents/
│ │ └── INC-YYYY-MM-DD-NNN.md
│ └── postmortems/
│ └── PM-YYYY-MM-DD-NNN.md
├── src/
│ ├── lib/
│ │ ├── logger.ts # Structured logging utility
│ │ ├── alerting.ts # Alert dispatch (Telegram/Slack/PagerDuty)
│ │ └── sentry.ts # Error tracking transport (Workers)
│ └── middleware/
│ └── monitoring.ts # Request tracking, timing, error capture
Checkpoint Integration
Location
{project-dir}/.checkpoints/oc-monitoring-ops.checkpoint.json
When to Write
| Event | What to Save |
|---|---|
| Setup complete | Tier, tools configured, instrumentation files created |
| Health check run | Status, latency, individual check results |
| Alert rules defined | Alert inventory, channel config |
| SLOs defined | SLI definitions, targets, error budget |
| Incident started | Incident ID, timeline, status |
| Incident resolved | Resolution, post-mortem reference |
| Audit run | Domain scores, maturity grade, recommendations |
| Runbook created | Runbook inventory, coverage map |
progress_table
[
{ "id": "setup", "label": "Observability setup", "status": "not_started" },
{ "id": "logging", "label": "Structured logging", "status": "not_started" },
{ "id": "errors", "label": "Error tracking", "status": "not_started" },
{ "id": "uptime", "label": "Uptime monitoring", "status": "not_started" },
{ "id": "alerts", "label": "Alerting pipeline", "status": "not_started" },
{ "id": "slos", "label": "SLO definition", "status": "not_started" },
{ "id": "runbooks", "label": "Operational runbooks", "status": "not_started" },
{ "id": "dashboard", "label": "Monitoring dashboard", "status": "not_started" }
]
skill_state
{
"tier": "T1",
"tools": {
"error_tracking": "sentry",
"uptime": "betterstack",
"alerting": "telegram",
"logging": "structured-json"
},
"slos": {
"availability": "99%",
"latency_p95": "1s",
"error_rate": "1%"
},
"last_health_check": {
"at": "2026-04-22T10:00:00Z",
"status": "ok",
"latency_ms": 42
},
"active_incidents": 0,
"runbook_count": 3,
"maturity_grade": "C"
}
Cross-Skill Reads
| Reads from | Why |
|---|---|
| oc-deploy-ops | Health check URLs, environment config, deploy history |
| oc-security-auditor | Detection/response requirements → what to monitor for |
| oc-scale-ops | Performance budgets → SLO/alert thresholds |
| oc-app-architect | Spec, error handling strategy → instrumentation targets |
| oc-code-auditor | Error handling gaps → logging instrumentation needs |
| Read by | Why |
|---|---|
| oc-deploy-ops | Health status → deploy confidence, post-deploy verification |
| oc-security-auditor | Detection/response maturity → Pillar 3 input |
| oc-scale-ops | Latency/error metrics → capacity planning data |
| oc-orchestrator | Active incidents, maturity grade → project health |
Tool Recommendations by Stack
Cloudflare Workers (aidops default)
| Domain | Free Tier | Paid | Notes |
|---|---|---|---|
| Logging | console.log + wrangler tail | Logpush → R2/S3 | Workers logs auto-available |
| Error tracking | Sentry (5K events/mo free) | Sentry Team ($26/mo) | Lightweight transport for Workers |
| Uptime | BetterStack (5 monitors free) | BetterStack ($24/mo) | Also does status pages |
| Alerting | Telegram bot (free) | PagerDuty | Telegram for solo; PagerDuty for teams |
| Dashboards | CF Analytics (built-in) | Grafana Cloud (free tier) | CF analytics covers most T0-T1 needs |
Vercel / Next.js
| Domain | Free Tier | Paid |
|---|---|---|
| Logging | Vercel Logs | Axiom (Vercel integration) |
| Error tracking | Sentry | Sentry |
| Uptime | BetterStack | Checkly |
| Alerting | Vercel Alerts | PagerDuty |
General (any stack)
| Domain | Recommended |
|---|---|
| Logging | Structured JSON → centralized log service |
| Error tracking | Sentry (dominant, best-in-class grouping) |
| Uptime | BetterStack or UptimeRobot |
| Alerting | Start with Telegram/Slack, graduate to PagerDuty |
| Status page | BetterStack (free tier includes status page) |
PM-Tool MCP Integration (v1.3+)
oc-monitoring-ops opens incident tickets in the PM tool when alerts fire, and back-references them through resolution.
The runtime contract — concrete tool names, retry policy, idempotency
markers, the pm_deferred_actions[] schema, and the extended state
vocabulary (resolved-pending-postmortem) — lives in
oc-integrations-engineer/references/pm-mcp-protocol.md.
All MCP calls below honour that contract; this section says only how
oc-monitoring-ops shapes incidents and per-event updates.
On alert fire
-
Look up the alert in the runbook registry (configured per alert id; see
runbooks/). -
Compose incident description, prefixed with the idempotency marker per protocol §3:
<!-- opchain:oc-monitoring-ops:incident-fired:<alert-event-id> --> Alert: {alert-name} ({severity}) Fired at: {iso-timestamp} Service: {service} Symptoms: {first-three-symptoms-from-alert-payload} Runbook: {runbook-url} On-call: {on-call-engineer-from-pagerduty} Recent deploys: {list-of-deploys-in-last-2h-from-deploy-ops-checkpoint} -
Pre-create check: call the registry-resolved
list_issuestool (Linear:mcp__claude_ai_Linear__list_issues; GitHub:mcp__mcp-server-github__list_issues) filtered to the configured project + theincidentissue type frompm.yaml.issue_types, description-text query for the marker. If a match exists, reuse the existing incident id (mid-burst alert dedupe). -
Otherwise call the registry-resolved
create_issuetool (Linear:mcp__claude_ai_Linear__save_issuewith noid; GitHub:mcp__mcp-server-github__issue_writeaction=create) with:issue_typefrompm.yaml.issue_types.incident(default “Incident”).- priority from alert severity mapping.
- labels from runbook (
incident,service:<name>,severity:<level>), merged withpm.yaml.labels_default. - parent / blocked-by relation to the most recent deploy ticket
if one is open (likely culprit) — read from
oc-deploy-ops.checkpoint.jsonskill_state.pm.deploy_tickets[].
-
Record incident id in
oc-monitoring-ops.checkpoint.jsonskill_state.pm.incidents[]with the correlating Sentry / PagerDuty event id.
Per-event updates during the incident
Each row uses a unique idempotency marker. Pre-write check via
list_comments (Linear) or issue_read (GitHub) before every
add_comment. State strings come from pm.yaml.states /
pm.yaml.states.extended — never hard-coded.
| Event | Marker | Action |
|---|---|---|
| Alert auto-resolves (back to baseline) | <!-- opchain:oc-monitoring-ops:auto-resolved:<incident-id> --> | add_comment: “Auto-resolved — {duration}”; transition → resolved-pending-postmortem (resolved from pm.yaml.states.extended). Do not close. |
| Engineer ack via PagerDuty | <!-- opchain:oc-monitoring-ops:acked:<incident-id>:<engineer-id> --> | add_comment: “{engineer} acknowledged”; transition → in_progress. |
| Status page update | <!-- opchain:oc-monitoring-ops:status-update:<incident-id>:<update-id> --> | add_comment mirroring the public update. |
| Postmortem published | <!-- opchain:oc-monitoring-ops:postmortem:<incident-id> --> | add_comment with link; transition → done. |
Postmortem back-reference
When the postmortem is written, oc-monitoring-ops appends an
action-item sub-ticket per remediation item, parent = the incident
ticket. Each remediation sub-ticket carries marker
<!-- opchain:oc-monitoring-ops:remediation:<incident-id>:<item-n> -->
in its description and is created via the create_issue tool with
the pre-create check pattern above. Each remediation sub-ticket is
assigned to the owning team’s default assignee (from
pm.yaml.remediation_owners map), with a target close date.
Alert hygiene
If an alert fires more than N times in 24h (default 5; configurable
per alert), oc-monitoring-ops adds a noisy-alert label (per protocol
Appendix A — labels, not state) to the incident and surfaces a
tuning recommendation rather than spamming new tickets — same
incident gets new comments, not new tickets. The pre-create check
in step 3 above ensures this naturally for in-window alerts.
/oc-monitor --retry-pm flush
Invokes the protocol §4 flush against
oc-monitoring-ops.checkpoint.json pm_deferred_actions[]. Filter to
skill: "oc-monitoring-ops" and retriable: true. Critical alerts
should NOT depend on flush succeeding — Telegram / PagerDuty fire
regardless; the flush is reconciliation only.
Failure modes
- MCP unavailable → alert fires through Telegram / PagerDuty as always; intended PM-MCP write is deferred per protocol §4. Operator visibility is unaffected.
- PM provider rate-limits during a major incident burst → batch into
a single rollup ticket via the marker dedupe in step 3 (the same
alert-event-id naturally folds into one incident); separate
comments per alert event with marker
<!-- opchain:oc-monitoring-ops:burst-event:<incident-id>:<event-n> -->. - 403 on incident creation → defer with
retriable: false; surface to the on-call channel as a side message; never silently widen scope.
Principles
- Monitor symptoms, not implementations. Alert on error rate, not on which specific database query failed. Symptoms catch failures you didn’t predict.
- Every alert needs a runbook. An alert without instructions is just noise.
- Error budgets over perfection. 99% uptime is fine for aidops-scale apps. That’s 7 hours of acceptable downtime per month. Use that budget for shipping.
- Structured logs are non-negotiable.
console.log("error")tells you nothing in production. JSON with request ID, user context, and timestamps tells you everything. - Proportional investment. T0 for personal apps, T3 for revenue-bearing products. Don’t build PagerDuty rotation for a 2-user fitness tracker.
- Post-mortems are investments, not punishments. Every incident you learn from is one that won’t repeat. Blameless or don’t bother.
- Observability is continuous. Setup is phase 1, not the finish line. Audit quarterly, refine alerts monthly, run health checks daily.
Use OC · Monitoring Ops in your project
Drop the SKILL.md into .claude/skills/ or
.codex/skills/, download the bundle, or reach
it over the hosted MCP endpoint.