all skills
build

OC · Monitoring Ops

Post-deploy observability — uptime, errors, alerts, incidents. v1.2 opens PM incident tickets when alerts fire.

build v1.5.0
Commands
/oc-monitor/oc-monitor setup/oc-monitor stack/oc-monitor instrument/oc-monitor health/oc-monitor errors/oc-monitor uptime/oc-monitor metrics/oc-monitor alerts/oc-monitor oncall/oc-monitor slo/oc-monitor incident/oc-monitor runbook/oc-monitor postmortem/oc-monitor dashboard/oc-monitor report/oc-monitor audit/oc-monitor compare/oc-monitor status
Pipeline phase

build

Get this skill

Drop the bundle into .claude/skills/ and Claude Code auto-discovers it on the next session — or point Codex / any MCP agent at the hosted opchain.dev/mcp endpoint.

How you'll use it

Post-deployment observability: uptime monitoring, error tracking, structured logging, alerting pipelines, and incident response runbooks. Sits after oc-deploy-ops in the pipeline — oc-deploy-ops ships it, oc-monitoring-ops watches it. Use for /oc-monitor, "set up monitoring", "error tracking", "uptime check", "alerting", "incident response", "observability", "what's happening in prod", "set up Sentry", "logging strategy", "on-call", "runbook", "SLO", "SLI", "is prod healthy", "why is it slow", "error rate", "status page". Trigger liberally.

Trigger with natural language or a slash command:

/oc-monitor/oc-monitor setup/oc-monitor stack/oc-monitor instrument/oc-monitor health/oc-monitor errors +13 more
SKILL.md ≈ 17 min read
Below is the file Claude reads on invocation. It's written in the model's voice — "read this", "do that" — not a user guide. The How you'll use it section above is the one for you.
On this page

    Monitoring Ops

    On first invocation, read references/orchestrator.md and follow its welcome protocol.

    Post-deployment observability skill. Deploy-ops ships the code; oc-monitoring-ops watches it run. Covers five domains: uptime monitoring, error tracking, structured logging, alerting pipelines, and incident response.

    This skill does NOT build features or deploy code — it instruments what’s already deployed and establishes the feedback loop that catches problems before users report them.

    /oc-monitor — Command Reference

    MONITORING OPS COMMANDS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
      SETUP
      /oc-monitor setup         Guided observability stack setup for a project
      /oc-monitor stack         Recommend monitoring tools for the stack
      /oc-monitor instrument    Add structured logging + error capture to codebase
    
      OBSERVE
      /oc-monitor health        Live health check — hit endpoints, report status
      /oc-monitor errors        Check error tracking service for recent issues
      /oc-monitor uptime        Show uptime status and recent incidents
      /oc-monitor metrics       Key metrics snapshot (latency, error rate, throughput)
      /oc-monitor status        Current monitoring state from checkpoint
    
      ALERT
      /oc-monitor alerts        Design or audit alerting rules
      /oc-monitor oncall        Set up on-call rotation and escalation
      /oc-monitor slo           Define or review SLOs/SLIs/error budgets
    
      RESPOND
      /oc-monitor incident      Start or review an incident response
      /oc-monitor runbook       Generate or update operational runbooks
      /oc-monitor postmortem    Structured post-incident review
    
      REPORT
      /oc-monitor dashboard     Design an ops monitoring dashboard (routes to oc-dash-forge)
      /oc-monitor report        Generate weekly/monthly ops report
      /oc-monitor audit         Full observability maturity assessment
      /oc-monitor compare       Compare two monitoring snapshots (drift detection)
    
      SESSION
      /checkpoint            Show checkpoint status
      /checkpoint show       Display full checkpoint JSON
      /checkpoint reset      Archive and restart
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
      Type any command to begin. /oc-monitor to see this again.

    Session Persistence (Checkpoint Protocol)

    Checkpoint: {project-dir}/.checkpoints/oc-monitoring-ops.checkpoint.json

    Resume on Start

    When any /oc-monitor command is invoked:

    1. Check for checkpoint
    2. If exists: show tier, tools, SLO status, active incidents, maturity grade, next action
    3. Ask: “Continue, restart, or show full checkpoint?”
    4. On continue: load context, resume from next_actions[0]

    What’s Tracked

    Monitoring-ops uniquely stores runtime state alongside pipeline progress — the last health check result, active incident count, and SLO budget consumption. This means the checkpoint serves double duty: session persistence AND operational snapshot.


    How This Skill Fits the Pipeline

    oc-reverse-spec → oc-app-architect → oc-git-ops → oc-deploy-ops → MONITORING-OPS
    
                                                  ┌─────────────┤
                                                  │             │
                                             oc-security-auditor   oc-scale-ops
                                             (detection/        (perf
                                              response input)    budgets)

    oc-deploy-ops ships it, oc-monitoring-ops watches it. The handoff:

    1. oc-deploy-ops completes production promotion
    2. oc-deploy-ops runs health check (basic HTTP 200 verification)
    3. If oc-monitoring-ops checkpoint exists: oc-monitoring-ops takes over ongoing observation
    4. If not: oc-deploy-ops suggests /oc-monitor setup for the project

    oc-deploy-ops’s health check is a one-shot verification. oc-monitoring-ops provides continuous observation, error aggregation, alerting, and incident coordination.

    Cross-Skill Connections

    SkillRelationship
    oc-deploy-opsUpstream. oc-deploy-ops ships → oc-monitoring-ops watches. Shares health check URLs, environment config.
    oc-security-auditorPeer. oc-security-auditor’s Pillar 3 (detection/response) maps directly to oc-monitoring-ops’s alerting + incident response. oc-security-auditor defines WHAT to detect; oc-monitoring-ops implements HOW to detect it.
    oc-scale-opsPeer. oc-scale-ops sets performance budgets; oc-monitoring-ops enforces them via alerting. Latency SLOs from oc-scale-ops become oc-monitoring-ops alert thresholds.
    oc-code-auditorUpstream consumer. oc-code-auditor’s /oc-audit pre-deploy findings can include “missing error handling” — oc-monitoring-ops’s /oc-monitor instrument addresses the gap at the observability layer.
    oc-app-architectUpstream. Reads spec for expected behaviors, user flows, and error handling strategy to inform what to monitor.
    oc-dash-forgeDownstream for visualization. /oc-monitor dashboard routes to oc-dash-forge with an ops archetype context for monitoring UI design.

    Observability Maturity Model

    Every project maps to a maturity tier. The setup wizard determines the tier from the project’s scale, sensitivity, and infrastructure.

    TierNameWhoWhat’s CoveredExample
    T0Bare MinimumSolo dev, personal appHealth endpoint + basic logging + crash alertsaidops apps on free tier
    T1FoundationsSmall team, internal toolT0 + error tracking + uptime monitoring + structured logsPenThreshold
    T2ProductionMulti-user product, SLA existsT1 + SLOs + alerting pipeline + runbooks + dashboardsSaaS MVP
    T3OperationalRevenue-bearing, on-call requiredT2 + incident response + post-mortems + distributed tracingScaled product

    Default for aidops-scale apps: T0 or T1. Don’t overengineer monitoring for a 2-user app. The setup wizard auto-detects the appropriate tier.


    Phase 1: Setup (/oc-monitor setup)

    Guided setup that instruments a project for observability. Reads existing config (oc-deploy-ops checkpoint, wrangler.toml, package.json) to avoid asking questions with known answers.

    Setup Wizard

    Use ask_user_input for remaining unknowns:

    1. What tier? (auto-detect from oc-scale-ops/oc-deploy-ops context, confirm)
    2. Error tracking provider? (Sentry, LogRocket, Highlight, BetterStack, or console-only)
    3. Uptime monitoring? (BetterStack, UptimeRobot, Checkly, or CF Health Checks)
    4. Alerting channel? (Telegram, Slack, Discord, email, PagerDuty)
    5. Status page? (BetterStack, Instatus, or none)

    Setup Deliverables

    ArtifactDescription
    monitoring/config.mdMonitoring strategy document — what’s monitored, thresholds, tools
    monitoring/runbooks/Operational runbooks (at least: deploy failure, error spike, downtime)
    src/lib/logger.ts (or equivalent)Structured logging utility
    src/middleware/monitoring.tsRequest tracking, error capture, latency measurement
    .monitoring.jsonTool configuration (URLs, check IDs, alert channels)

    Health Endpoint Contract

    Every monitored app needs a GET /api/health endpoint. This is the shared contract between oc-deploy-ops (one-shot check), oc-monitoring-ops (continuous check), and external uptime monitors.

    Response shape:

    {
      "status": "ok | degraded | unhealthy",
      "checks": { "database": "ok", "kv": "ok" },
      "version": "1.2.3",
      "timestamp": "2026-04-23T10:00:00Z"
    }

    Rules:

    • Returns 200 when all checks pass, 503 when any check is “down”
    • Each dependency (DB, KV, external API) gets its own named check
    • Include deployed version for deploy-correlation
    • Include per-check latency at T2+ (for oc-scale-ops integration)

    Read references/instrumentation.md for full implementation patterns per stack (Workers, Next.js, FastAPI), including structured logger, request monitoring middleware, and Sentry transport for Workers.


    Phase 2: Observe

    Health Check (/oc-monitor health)

    Goes deeper than oc-deploy-ops’s one-shot 200 check:

    1. HTTP status — hit /api/health, verify 200
    2. Per-dependency status — parse response body, check each named service (DB, KV, etc.)
    3. Latency profile — 5 samples, report avg and max
    4. Certificate expiry — TLS cert days remaining
    5. Version correlation — compare deployed version to last-known-good

    Output uses the Health Check Report format from references/output-templates.md.

    Error Tracking (/oc-monitor errors)

    Checks the configured error tracking service. Adapts to what’s available:

    ServiceMethodWhat’s Reported
    SentrySentry Issues APIUnresolved count, top 10 by recency, severity breakdown
    Console-onlywrangler tail structured log grepRecent error-level entries, deduplication by message
    BetterStack / HighlightService-specific APIUnresolved issues + error trends

    Metrics Snapshot (/oc-monitor metrics)

    Aggregate key metrics from available sources:

    PRODUCTION METRICS — [project]
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
      Uptime (30d):    99.87%  (2h 17m total downtime)
      Error rate:      0.3%    (12 errors / 4,200 requests today)
      p50 latency:     42ms
      p95 latency:     187ms
      p99 latency:     890ms
      Active errors:   3 unresolved (1 HIGH, 2 LOW)
      Last deploy:     2 days ago (v1.2.3)
      Next cert expiry: 47 days
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Status (/oc-monitor status)

    Reads the checkpoint and displays the current monitoring state without hitting any live endpoints. This is a checkpoint read, not a health check.

    MONITORING STATUS — [project]
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
      Tier:          T1 (Foundations)
      Maturity:      C (50%)
      Tools:         Sentry + BetterStack + Telegram
      SLOs:          99% avail / 1s p95 / 1% errors
      Budget:        86% remaining (24 days left in window)
      Last health:   2 min ago — ✅ ok (42ms)
      Incidents:     0 active
      Runbooks:      3 of 6 alert-mapped
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Phase 3: Alert (/oc-monitor alerts)

    Alert Design Principles

    1. Every alert must have an action. If you can’t do anything when it fires, it’s noise, not an alert. Remove it.
    2. Two thresholds: warn and critical. Warn = investigate soon. Critical = act now.
    3. Alert on symptoms, not causes. “Error rate > 5%” is better than “database connection pool exhausted” — the symptom catches more failure modes.
    4. Pair every alert with a runbook. The alert says “something’s wrong.” The runbook says “here’s what to do.”

    Standard Alert Set (T1+)

    AlertConditionSeverityRunbook
    Endpoint downHealth check fails 2 consecutive timesCRITICALrunbooks/downtime.md
    High error rate>5% of requests return 5xx in 5-min windowCRITICALrunbooks/error-spike.md
    Elevated error rate>1% of requests return 5xx in 15-min windowWARNrunbooks/error-spike.md
    High latencyp95 > 2s for 5 minutesWARNrunbooks/latency.md
    Certificate expiringSSL cert expires within 14 daysWARNrunbooks/cert-renewal.md
    Error budget burn>50% of monthly error budget consumedWARNrunbooks/error-budget.md
    Deploy failureCI/CD pipeline fails on productionCRITICALrunbooks/deploy-failure.md

    SLO Definition (/oc-monitor slo)

    Service Level Objectives quantify “good enough.” They prevent over-engineering reliability for apps that don’t need five nines.

    Metric (SLI)T0 TargetT1 TargetT2 TargetT3 Target
    Availability95% (36h down/mo)99% (7h down/mo)99.9% (43m down/mo)99.95% (22m down/mo)
    Latency (p95)<2s<1s<500ms<200ms
    Error rate<5%<1%<0.1%<0.05%
    Data freshnessBest effort<5 min<1 min<30s

    Error budget = 100% − SLO. If your SLO is 99% availability, you have 7.3 hours of downtime per month before you violate it. Spend that budget on deploys and feature velocity, not perfection.

    Read references/alerting-patterns.md for channel setup (Telegram, Slack, PagerDuty), alert transport code, and SLO burn-rate alerting math.


    Phase 4: Respond (/oc-monitor incident)

    Read references/incident-response.md for full incident template, runbook template, and post-mortem template.

    Incident Lifecycle

    ALERT FIRES
    
    
    ACKNOWLEDGE (who's looking at it?)
    
    
    DIAGNOSE (what's broken? use runbook decision tree)
    
        ├──► QUICK FIX (restart, rollback, scale) ─────┐
        │                                               │
        ├──► FIX FORWARD (patch + deploy) ──────────────┤
        │                                               │
        └──► ESCALATE (beyond current expertise) ───────┤
    
                                          MITIGATE (service restored)
    
    
                                          RESOLVE (root cause addressed)
    
    
                                          POST-MORTEM (blameless, action items)

    Key Files

    FilePurpose
    monitoring/runbooks/downtime.mdEndpoint unreachable — diagnosis + recovery
    monitoring/runbooks/error-spike.mdError rate elevated — triage + mitigation
    monitoring/runbooks/deploy-failure.mdDeploy broke prod — rollback procedure
    monitoring/runbooks/latency.mdResponse times degraded — diagnosis
    monitoring/incidents/Incident logs (one file per incident)
    monitoring/postmortems/Post-incident reviews

    Observability Audit (/oc-monitor audit)

    Full maturity assessment. Scores each domain and recommends the next upgrade.

    Audit Domains

    DomainWeightWhat it measures
    Logging20%Structured logs, context propagation, retention, searchability
    Error Tracking20%Capture rate, grouping, alerting, triage workflow
    Uptime20%Monitoring coverage, check frequency, historical availability
    Alerting20%Alert quality (action per alert), channel reliability, runbook coverage
    Incident Response20%Runbooks exist, post-mortem practice, on-call defined

    Tier-adaptive grading: At T0, incident response and alerting carry reduced weight (a solo dev doesn’t need PagerDuty). At T2+, all five domains are equally critical. The audit grades against the project’s declared tier, not against T3 universally.

    Maturity Scoring

    ScoreMeaning
    A (90%+)Production-ready operations — you’d sleep through deploys
    B (70-89%)Solid foundations — most issues caught automatically
    C (50-69%)Basic coverage — you find out about problems, eventually
    D (30-49%)Reactive — users report problems before you see them
    F (<30%)No observability — flying blind

    See references/output-templates.md for full audit report format, ops report template, and metrics snapshot format.


    Dashboard Routing (/oc-monitor dashboard)

    When the user wants a monitoring dashboard, route to oc-dash-forge with ops archetype pre-selected:

    Detected monitoring dashboard request. Routing to oc-dash-forge with ops archetype.
    
    oc-dash-forge will produce:
      - High-density ops layout (dark mode, tight tiles)
      - Real-time KPI tiles (error rate, latency, throughput, uptime)
      - Incident table, event feed, throughput chart
      - Working React prototype with mock data
    
    Proceeding to /oc-data-forge with ops context...

    Package the following context for oc-dash-forge:

    • Archetype: ops (pre-selected, skip interview)
    • Data sources: health endpoint, error tracking, uptime service, structured logs
    • KPIs: error rate, p95 latency, uptime %, active incidents, deploy recency
    • Refresh cadence: near-real-time (30s-2min)

    Ops Report (/oc-monitor report)

    Generate a periodic operations summary covering availability, performance, errors, SLO status, deploys, and incidents. Cadence depends on tier: T1 = monthly, T2+ = weekly.

    The report pulls from: oc-monitoring-ops checkpoint (SLO budget, incident count), oc-deploy-ops checkpoint (deploy count, rollbacks), and error tracking service (error counts, top issues).

    See references/output-templates.md for the full report template.


    Snapshot Comparison (/oc-monitor compare)

    Compares two monitoring snapshots to track observability posture over time. Mirrors oc-security-auditor’s /oc-security compare pattern.

    Input: two checkpoint timestamps or dates. If one argument, compares against current state.

    Diffs: maturity grade, domain scores, SLO budget consumption, incident count, runbook coverage. Highlights regressions prominently. Uses the comparison format from references/output-templates.md.


    File Structure

    project-dir/
    ├── .checkpoints/
    │   └── oc-monitoring-ops.checkpoint.json
    ├── .monitoring.json                 # Tool config (URLs, check IDs, channels)
    ├── monitoring/
    │   ├── config.md                    # Strategy: what's monitored, thresholds, tools
    │   ├── runbooks/
    │   │   ├── downtime.md
    │   │   ├── error-spike.md
    │   │   ├── deploy-failure.md
    │   │   ├── latency.md
    │   │   ├── cert-renewal.md
    │   │   └── error-budget.md
    │   ├── incidents/
    │   │   └── INC-YYYY-MM-DD-NNN.md
    │   └── postmortems/
    │       └── PM-YYYY-MM-DD-NNN.md
    ├── src/
    │   ├── lib/
    │   │   ├── logger.ts                # Structured logging utility
    │   │   ├── alerting.ts              # Alert dispatch (Telegram/Slack/PagerDuty)
    │   │   └── sentry.ts               # Error tracking transport (Workers)
    │   └── middleware/
    │       └── monitoring.ts            # Request tracking, timing, error capture

    Checkpoint Integration

    Location

    {project-dir}/.checkpoints/oc-monitoring-ops.checkpoint.json

    When to Write

    EventWhat to Save
    Setup completeTier, tools configured, instrumentation files created
    Health check runStatus, latency, individual check results
    Alert rules definedAlert inventory, channel config
    SLOs definedSLI definitions, targets, error budget
    Incident startedIncident ID, timeline, status
    Incident resolvedResolution, post-mortem reference
    Audit runDomain scores, maturity grade, recommendations
    Runbook createdRunbook inventory, coverage map

    progress_table

    [
      { "id": "setup",      "label": "Observability setup",    "status": "not_started" },
      { "id": "logging",    "label": "Structured logging",     "status": "not_started" },
      { "id": "errors",     "label": "Error tracking",         "status": "not_started" },
      { "id": "uptime",     "label": "Uptime monitoring",      "status": "not_started" },
      { "id": "alerts",     "label": "Alerting pipeline",      "status": "not_started" },
      { "id": "slos",       "label": "SLO definition",         "status": "not_started" },
      { "id": "runbooks",   "label": "Operational runbooks",   "status": "not_started" },
      { "id": "dashboard",  "label": "Monitoring dashboard",   "status": "not_started" }
    ]

    skill_state

    {
      "tier": "T1",
      "tools": {
        "error_tracking": "sentry",
        "uptime": "betterstack",
        "alerting": "telegram",
        "logging": "structured-json"
      },
      "slos": {
        "availability": "99%",
        "latency_p95": "1s",
        "error_rate": "1%"
      },
      "last_health_check": {
        "at": "2026-04-22T10:00:00Z",
        "status": "ok",
        "latency_ms": 42
      },
      "active_incidents": 0,
      "runbook_count": 3,
      "maturity_grade": "C"
    }

    Cross-Skill Reads

    Reads fromWhy
    oc-deploy-opsHealth check URLs, environment config, deploy history
    oc-security-auditorDetection/response requirements → what to monitor for
    oc-scale-opsPerformance budgets → SLO/alert thresholds
    oc-app-architectSpec, error handling strategy → instrumentation targets
    oc-code-auditorError handling gaps → logging instrumentation needs
    Read byWhy
    oc-deploy-opsHealth status → deploy confidence, post-deploy verification
    oc-security-auditorDetection/response maturity → Pillar 3 input
    oc-scale-opsLatency/error metrics → capacity planning data
    oc-orchestratorActive incidents, maturity grade → project health

    Tool Recommendations by Stack

    Cloudflare Workers (aidops default)

    DomainFree TierPaidNotes
    Loggingconsole.log + wrangler tailLogpush → R2/S3Workers logs auto-available
    Error trackingSentry (5K events/mo free)Sentry Team ($26/mo)Lightweight transport for Workers
    UptimeBetterStack (5 monitors free)BetterStack ($24/mo)Also does status pages
    AlertingTelegram bot (free)PagerDutyTelegram for solo; PagerDuty for teams
    DashboardsCF Analytics (built-in)Grafana Cloud (free tier)CF analytics covers most T0-T1 needs

    Vercel / Next.js

    DomainFree TierPaid
    LoggingVercel LogsAxiom (Vercel integration)
    Error trackingSentrySentry
    UptimeBetterStackCheckly
    AlertingVercel AlertsPagerDuty

    General (any stack)

    DomainRecommended
    LoggingStructured JSON → centralized log service
    Error trackingSentry (dominant, best-in-class grouping)
    UptimeBetterStack or UptimeRobot
    AlertingStart with Telegram/Slack, graduate to PagerDuty
    Status pageBetterStack (free tier includes status page)

    PM-Tool MCP Integration (v1.3+)

    oc-monitoring-ops opens incident tickets in the PM tool when alerts fire, and back-references them through resolution.

    The runtime contract — concrete tool names, retry policy, idempotency markers, the pm_deferred_actions[] schema, and the extended state vocabulary (resolved-pending-postmortem) — lives in oc-integrations-engineer/references/pm-mcp-protocol.md. All MCP calls below honour that contract; this section says only how oc-monitoring-ops shapes incidents and per-event updates.

    On alert fire

    1. Look up the alert in the runbook registry (configured per alert id; see runbooks/).

    2. Compose incident description, prefixed with the idempotency marker per protocol §3:

      <!-- opchain:oc-monitoring-ops:incident-fired:<alert-event-id> -->
      
      Alert: {alert-name} ({severity})
      Fired at: {iso-timestamp}
      Service: {service}
      Symptoms: {first-three-symptoms-from-alert-payload}
      Runbook: {runbook-url}
      On-call: {on-call-engineer-from-pagerduty}
      Recent deploys: {list-of-deploys-in-last-2h-from-deploy-ops-checkpoint}
    3. Pre-create check: call the registry-resolved list_issues tool (Linear: mcp__claude_ai_Linear__list_issues; GitHub: mcp__mcp-server-github__list_issues) filtered to the configured project + the incident issue type from pm.yaml.issue_types, description-text query for the marker. If a match exists, reuse the existing incident id (mid-burst alert dedupe).

    4. Otherwise call the registry-resolved create_issue tool (Linear: mcp__claude_ai_Linear__save_issue with no id; GitHub: mcp__mcp-server-github__issue_write action=create) with:

      • issue_type from pm.yaml.issue_types.incident (default “Incident”).
      • priority from alert severity mapping.
      • labels from runbook (incident, service:<name>, severity:<level>), merged with pm.yaml.labels_default.
      • parent / blocked-by relation to the most recent deploy ticket if one is open (likely culprit) — read from oc-deploy-ops.checkpoint.json skill_state.pm.deploy_tickets[].
    5. Record incident id in oc-monitoring-ops.checkpoint.json skill_state.pm.incidents[] with the correlating Sentry / PagerDuty event id.

    Per-event updates during the incident

    Each row uses a unique idempotency marker. Pre-write check via list_comments (Linear) or issue_read (GitHub) before every add_comment. State strings come from pm.yaml.states / pm.yaml.states.extended — never hard-coded.

    EventMarkerAction
    Alert auto-resolves (back to baseline)<!-- opchain:oc-monitoring-ops:auto-resolved:<incident-id> -->add_comment: “Auto-resolved — {duration}”; transition → resolved-pending-postmortem (resolved from pm.yaml.states.extended). Do not close.
    Engineer ack via PagerDuty<!-- opchain:oc-monitoring-ops:acked:<incident-id>:<engineer-id> -->add_comment: “{engineer} acknowledged”; transition → in_progress.
    Status page update<!-- opchain:oc-monitoring-ops:status-update:<incident-id>:<update-id> -->add_comment mirroring the public update.
    Postmortem published<!-- opchain:oc-monitoring-ops:postmortem:<incident-id> -->add_comment with link; transition → done.

    Postmortem back-reference

    When the postmortem is written, oc-monitoring-ops appends an action-item sub-ticket per remediation item, parent = the incident ticket. Each remediation sub-ticket carries marker <!-- opchain:oc-monitoring-ops:remediation:<incident-id>:<item-n> --> in its description and is created via the create_issue tool with the pre-create check pattern above. Each remediation sub-ticket is assigned to the owning team’s default assignee (from pm.yaml.remediation_owners map), with a target close date.

    Alert hygiene

    If an alert fires more than N times in 24h (default 5; configurable per alert), oc-monitoring-ops adds a noisy-alert label (per protocol Appendix A — labels, not state) to the incident and surfaces a tuning recommendation rather than spamming new tickets — same incident gets new comments, not new tickets. The pre-create check in step 3 above ensures this naturally for in-window alerts.

    /oc-monitor --retry-pm flush

    Invokes the protocol §4 flush against oc-monitoring-ops.checkpoint.json pm_deferred_actions[]. Filter to skill: "oc-monitoring-ops" and retriable: true. Critical alerts should NOT depend on flush succeeding — Telegram / PagerDuty fire regardless; the flush is reconciliation only.

    Failure modes

    • MCP unavailable → alert fires through Telegram / PagerDuty as always; intended PM-MCP write is deferred per protocol §4. Operator visibility is unaffected.
    • PM provider rate-limits during a major incident burst → batch into a single rollup ticket via the marker dedupe in step 3 (the same alert-event-id naturally folds into one incident); separate comments per alert event with marker <!-- opchain:oc-monitoring-ops:burst-event:<incident-id>:<event-n> -->.
    • 403 on incident creation → defer with retriable: false; surface to the on-call channel as a side message; never silently widen scope.

    Principles

    1. Monitor symptoms, not implementations. Alert on error rate, not on which specific database query failed. Symptoms catch failures you didn’t predict.
    2. Every alert needs a runbook. An alert without instructions is just noise.
    3. Error budgets over perfection. 99% uptime is fine for aidops-scale apps. That’s 7 hours of acceptable downtime per month. Use that budget for shipping.
    4. Structured logs are non-negotiable. console.log("error") tells you nothing in production. JSON with request ID, user context, and timestamps tells you everything.
    5. Proportional investment. T0 for personal apps, T3 for revenue-bearing products. Don’t build PagerDuty rotation for a 2-user fitness tracker.
    6. Post-mortems are investments, not punishments. Every incident you learn from is one that won’t repeat. Blameless or don’t bother.
    7. Observability is continuous. Setup is phase 1, not the finish line. Audit quarterly, refine alerts monthly, run health checks daily.

    Use OC · Monitoring Ops in your project

    Drop the SKILL.md into .claude/skills/ or .codex/skills/, download the bundle, or reach it over the hosted MCP endpoint.