Metrics & OpenTelemetry Instrumentation: Numbers That Tell You Why At 3 AM

⬅️ Day 6: Grow Overview

If you're shipping a SaaS in 2026 and only have logs (no metrics, no traces), debugging production is ten-times harder than it needs to be. Logs answer "what happened on this specific request." Metrics answer "is the system healthy in aggregate." Traces answer "where exactly is the slowdown happening across services." Most indie SaaS ships with logs only; gets paged at 3 AM; spends an hour grepping logs to find a pattern; realizes a metric dashboard would have shown the issue in seconds. The fix is a deliberate observability stack — OpenTelemetry (the 2026 vendor-neutral standard) for tracing + metrics, paired with a backend (Datadog / Honeycomb / Grafana / New Relic / Vercel Observability) that stores and queries.

A working metrics + tracing strategy answers: what to instrument (USE / RED / Golden Signals), how to instrument (OpenTelemetry SDK; auto-instrumentation), what backend (Datadog / Honeycomb / Grafana / Vercel Observability), how to define SLIs and SLOs, how to alert (without alert fatigue), and how to make this affordable at indie scale (sample wisely; don't pay $5K/mo for traces).

This guide is the implementation playbook for metrics + traces. Companion to Logging Strategy & Structured Logs, Performance Optimization, Service Level Agreements, Incident Response, and HTTP Retry & Backoff.

Why Metrics & Traces Matter

Get the failure modes clear first.

Help me understand what logs alone miss.

The 6 categories logs can't answer:

**1. Aggregate trends**
"Is p95 latency rising?" Logs: scrape thousands; computed by you. Metric: graph in seconds.

**2. Cross-request patterns**
"Is THIS endpoint slowing down for ALL users?" Logs: grep every request. Metric: dashboard.

**3. Distributed traces**
"User reports slow checkout. Where in the path?" Logs: grep across 4 services; correlate. Trace: visual span tree shows exact slow piece.

**4. Capacity planning**
"How busy is our DB?" Logs: not really. Metric: connection-pool utilization graph.

**5. Alerts on patterns**
"Alert if error rate > 5%." Logs: hard. Metric: alert rule.

**6. SLI / SLO tracking**
"What % of requests are < 200ms?" Logs: aggregate manually. Metric: built-in.

For my system:
- Top 5 production debugging pains
- Time spent grepping logs

Output:
1. Pains addressed by metrics
2. Pains addressed by traces
3. Priority order

The biggest unforced error: logs-only observability past 5 services. Distributed systems need traces; aggregate health needs metrics; debugging just gets harder without them.

The Three Pillars: Logs, Metrics, Traces

Help me understand the layers.

The 3 pillars:

**1. Logs** (you have these)
- Discrete events with context
- Use for: specific request investigation; auditing; error context
- Volume: high; cost: high if retained long
- Tools: Datadog Logs / Loki / CloudWatch / Vercel Logs

**2. Metrics**
- Time-series numbers
- Use for: aggregate health; alerting; trends
- Volume: low; cost: low
- Tools: Prometheus / VictoriaMetrics / Datadog / Grafana / Honeycomb

**3. Traces**
- Request paths across services with timing
- Use for: latency debugging; distributed-system understanding
- Volume: medium; cost: medium (sampling)
- Tools: Jaeger / Zipkin / Tempo / Datadog APM / Honeycomb / Vercel Observability

**The 2026 reality**:

OpenTelemetry (OTel) is the standard:
- Single SDK instruments your code once
- Outputs traces + metrics + logs in standard format
- Send to any backend (Datadog / Honeycomb / Grafana Cloud / New Relic / Tempo / etc.)
- Vendor-neutral; switch backends without re-instrumenting

**Backend options**:

Cloud-managed:
- **Datadog** — most-popular; expensive
- **New Relic** — alternative; $99 free tier 100GB
- **Honeycomb** — observability-2.0; great for traces
- **Grafana Cloud** — open-source-friendly; cost-effective
- **Vercel Observability** — bundled if Vercel-hosted
- **AWS CloudWatch** — bundled if AWS-locked

Self-host:
- **Grafana + Tempo + Loki + Mimir** — full open-source stack
- **SigNoz** — OSS Datadog alternative
- **Prometheus + Tempo + Loki** — composable

For my stack: [pick]

Output:
1. Three pillars status
2. Backend pick
3. Migration plan

The 2026 default: OpenTelemetry SDK + Honeycomb / Grafana Cloud / Vercel Observability for indie / mid-market. Datadog when budget allows or enterprise procurement requires.

OpenTelemetry: Instrument Once, Send Anywhere

Help me set up OpenTelemetry.

The basic setup:

```typescript
// Node.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  serviceName: 'my-saas',
  traceExporter: new OTLPTraceExporter({
    url: 'https://api.honeycomb.io/v1/traces',
    headers: { 'x-honeycomb-team': process.env.HONEYCOMB_API_KEY },
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Auto-instrumentation captures:

HTTP server / client
Express / Fastify / Hono routes
Postgres / MySQL / MongoDB queries
Redis commands
gRPC calls
AWS SDK calls

You get traces immediately for free.

Custom spans:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('my-saas');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);
    try {
      const result = await doWork();
      span.setAttribute('result.size', result.length);
      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

Custom metrics:

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('my-saas');
const checkoutCounter = meter.createCounter('checkouts_completed_total');
const checkoutDuration = meter.createHistogram('checkout_duration_ms');

async function checkout() {
  const start = Date.now();
  const result = await doCheckout();
  checkoutCounter.add(1, { status: result.status });
  checkoutDuration.record(Date.now() - start, { plan: result.plan });
  return result;
}

Logs (via OTel):

OpenTelemetry Logs are still maturing. Most stacks: structured logs in their own library (pino / winston) + OTel for traces + metrics.

For my stack: [language]

Output:

SDK setup
Auto-instrumentation
Custom spans / metrics


The win that compounds: **OpenTelemetry's vendor-neutrality**. Today: send to Honeycomb. Tomorrow: switch to Datadog without re-instrumenting. Lock-in avoided.

## Metric Frameworks: USE, RED, Golden Signals

Help me decide what to measure.

The 3 frameworks:

USE (Brendan Gregg): per resource (CPU / memory / disk / network)

Utilization: % busy
Saturation: queue depth
Errors: error count

For: infrastructure monitoring (server CPU; disk).

RED (Tom Wilkie): per service / endpoint

Rate: requests / second
Errors: error count or rate
Duration: latency distribution

For: service-level monitoring (API endpoints).

Golden Signals (Google SRE): per service

Latency: response time distribution
Traffic: request rate
Errors: error rate
Saturation: how full the system is

For: holistic service monitoring (combines RED + USE).

For most SaaS in 2026:

Per HTTP endpoint, track:

Request rate (req/s)
Error rate (% errors)
p50 / p95 / p99 latency

Per background job:

Job rate (jobs/s)
Failure rate
Duration distribution

Per critical resource:

DB connection pool utilization
Redis memory
Queue depth

For my service: [endpoints]

Output:

Framework pick
Metrics per endpoint
Per-resource metrics


The pragmatic 2026 default: **RED for services + USE for infrastructure**. Golden Signals is the union; same idea different naming.

## SLIs and SLOs

Help me set SLIs and SLOs.

SLI (Service Level Indicator): a metric measuring service quality.

Examples:

"% of requests completing in <200ms"
"% of requests succeeding (no 5xx)"
"% of background jobs completing within SLA"

SLO (Service Level Objective): the target for an SLI.

Examples:

"99% of requests in <200ms (over 30 days)"
"99.9% success rate (over 30 days)"
"95% of background jobs in <5 min"

Error budget:

100% - SLO = error budget. 99% SLO → 1% error budget = ~7 hours/month of allowed missed-SLO.

When you exhaust budget: stop releasing risky changes; focus on reliability.

Common SLOs for SaaS:

API: 99.9% success; 99% < 200ms
Login flow: 99.99% success
Webhook delivery: 95% within 60s; 99% within 5 min
Email send: 99% within 5 min

Implementing:

Define SLI in observability tool:

sli:
  request_success:
    metric: http_requests_total{status!~"5.."}
    threshold: 99.9% over 30d
  request_latency:
    metric: http_request_duration_ms
    threshold: 95% < 200ms over 30d

Tools:

Honeycomb / Datadog / Grafana have SLO features
Manual via metric query + alert

For my service:

Critical endpoints
SLO targets

Output:

SLI list
SLOs per
Error-budget tracking


The discipline: **SLOs drive prioritization**. When error budget is depleted, ship reliability work; don't ship features. This converts "should we improve quality?" debates into objective decisions.

## Tracing in Practice

Help me use traces.

Trace = full request path with timing per span.

Example: customer reports slow checkout.

Without traces:

Grep logs for their request
Correlate timestamps across services
Estimate which piece is slow
Guess

With traces:

Search Honeycomb / Datadog for slow checkouts
Click a slow trace
Visual: API gateway 5ms → auth 10ms → checkout 4500ms → payment 100ms
→ Checkout function is the slow piece; investigate

Auto-instrumentation traces:

OpenTelemetry auto-instrumentation gives you:

HTTP request spans
DB query spans (with SQL)
HTTP client spans (calls to external APIs)
Redis command spans

For free.

Adding custom spans:

For business logic worth tracing:

await tracer.startActiveSpan('parse_csv', async (span) => {
  span.setAttribute('csv.row_count', rows.length);
  // ...
  span.end();
});

Sampling:

Traces are voluminous. Most apps sample.

Strategies:

Head-based: decide at request start (e.g. 10% of requests)
Tail-based: decide at request end (e.g. all errors + 1% of normal)
Adaptive: more sampling on errors / slow requests

Most tools (Honeycomb / Tempo / Vercel) support tail-based.

Cost management:

100K req/day × all traces × 30 days = expensive. 100K req/day × 1% sampled + all errors = affordable.

For my product:

Trace volume estimate
Sampling strategy

Output:

Trace coverage
Custom spans
Sampling


The single highest-leverage observability investment: **tail-based sampling with all errors traced + 1% normal**. Catches all anomalies; costs 99% less.

## Backends: Cost vs Features

Help me pick a backend.

The 2026 landscape:

Datadog:

Most-popular; comprehensive
Pricing: $15-50/host + $1-5 per million metrics + $1-3 per GB logs + $1-3 per million traces
Total: $500-5000/mo at indie scale; $20K-200K/yr at mid-market
Pros: best UX; most integrations
Cons: very expensive

Honeycomb:

Observability-2.0 (events-based; queryable)
Pricing: Free 20M events/mo; $130/mo for 100M events; tiered
Pros: best for traces; great query language; reasonable price
Cons: events model different from traditional metrics

Grafana Cloud:

Tempo (traces) + Mimir (metrics) + Loki (logs)
Pricing: free 10K series metrics; $99/mo for 100GB logs; etc.
Pros: cost-effective; OSS-aligned; powerful
Cons: setup complexity

New Relic:

All-in-one observability
Pricing: 100GB free; $0.30/GB after; user-based seats
Pros: free tier is real (most usable free tier)
Cons: dated UX in places

Vercel Observability:

Bundled with Vercel deployments
Free tier; usage-based
Pros: zero-config for Vercel apps
Cons: less powerful than dedicated tools

SigNoz (OSS Datadog alternative):

Self-host; cloud option
Free OSS; cloud $200/mo+
Pros: cost-effective; modern
Cons: ops burden if self-host

Sentry:

Errors-focused; performance monitoring layered
Pricing: $26-89/mo team; usage-based
Pros: best error tracking
Cons: not full APM

The 2026 default for indie: Vercel Observability (if Vercel-hosted) OR Honeycomb Free OR Grafana Cloud Free.

For mid-market: Honeycomb OR Grafana Cloud OR New Relic.

For enterprise: Datadog if budget OK; New Relic alternative.

For my stack: [pick]

Output:

Backend pick
Cost estimate
Migration plan


The 2026 cost reality: **Datadog at indie scale = $500-2000/mo; same observability via Honeycomb / Grafana Cloud = $50-300/mo**. Pick by cost; performance at indie scale is not the differentiator.

## Alerting Without Fatigue

Help me alert correctly.

The 4 alert principles:

1. Alert on symptoms, not causes

Bad: "DB CPU > 80%" (cause; might not affect users) Good: "p95 latency > 500ms for 5 min" (symptom; affects users)

2. Alert on what wakes someone

If alert fires at 3 AM, on-call wakes up. Make it worth waking.

Per alert: "Is this worth paging at 3 AM?" If no: lower severity (Slack channel; not page).

3. Hierarchy of severity

Critical: page immediately (production-down; SLO-breach)
High: page during business hours
Medium: Slack alert; team reviews
Low: dashboard only

4. Reduce false-positive rate

False alerts → alert fatigue → real alerts ignored.

Tune thresholds based on baselines
Use rate-of-change instead of absolute (% of normal)
Multi-condition alerts (latency + error rate + duration)

Standard alerts for SaaS:

Production 5xx rate > 1% for 5 min → critical
p95 latency > [SLO threshold] for 10 min → critical
DB connection pool > 90% for 5 min → high
Disk full → 80% → high; 95% → critical
Background job queue depth growing for 10 min → high
Auth failure rate > 5% for 5 min → critical

Runbook per alert:

Each alert links to runbook (Notion / wiki) with:

What's the alert
Likely causes
Diagnostic steps
Mitigation steps
Escalation

For my alerts: [audit]

Output:

Alert list
Severity tiers
Runbooks


The discipline: **runbook per alert**. On-call wakes; alert fires; runbook gives steps. Without: 30 minutes of fumbling. With: action in 5 min.

## Common Observability Mistakes

Help me avoid mistakes.

The 10 mistakes:

1. Logs only; no metrics or traces Debugging takes 10x longer.

2. No OpenTelemetry; vendor-locked Switching backend = re-instrument.

3. Sampling 100% of traces at scale Cost explodes.

4. Alerting on causes (CPU); not symptoms (latency) Pages on irrelevant; misses real.

5. No SLOs Reliability vs feature debates have no objective resolution.

6. Every alert is "critical" Alert fatigue; real alerts missed.

7. No runbooks On-call wakes; doesn't know what to do.

8. Datadog at indie scale $2K/mo when $200/mo would do.

9. No custom metrics for business logic "Is checkout working?" not measurable.

10. Observability as afterthought Bolt on at year 3; technical debt + production pain accumulated.

For my system: [risks]

Output:

Top 3 risks
Mitigations
Audit


The single most-painful mistake: **shipping without observability and adding it after a major outage**. Outage happens; you can't see what's wrong; recovery is slow; trust damage. Instrument from Day 1.

## What Done Looks Like

A working observability stack:
- OpenTelemetry SDK in all services
- Auto-instrumentation for HTTP / DB / Redis
- Custom spans for business logic
- Custom metrics for business KPIs (checkouts / signups / etc.)
- Backend: Honeycomb / Grafana Cloud / Vercel Observability / Datadog
- Tail-based sampling (all errors + 1% normal)
- 5-10 SLOs with error-budget tracking
- 5-15 alerts (mostly symptoms, not causes)
- Runbook per alert
- Cost: <2% of revenue at indie scale

The proof you got it right: when production breaks at 3 AM, on-call gets paged with specific alert; opens runbook; identifies cause via traces in 10 min; fixes in 30. Compares to logs-only: 2-hour debugging session.

## See Also

- [Logging Strategy & Structured Logs](logging-strategy-structured-logs-chat.md) — companion log layer
- [Performance Optimization](performance-optimization-chat.md) — metrics inform optimization
- [Service Level Agreements](service-level-agreements-chat.md) — SLAs depend on SLOs
- [Incident Response](incident-response-chat.md) — alerts → incident response
- [HTTP Retry & Backoff](http-retry-backoff-chat.md) — instrument retries
- [Multi-region Deployment](multi-region-deployment-chat.md) — region-tagged metrics
- [Database Indexing Strategy](database-indexing-strategy-chat.md) — query metrics drive indexing
- [Customer Analytics Dashboards](customer-analytics-dashboards-chat.md) — adjacent product analytics
- [Status Page (vibeweek)](status-page-chat.md) — SLO breaches trigger status updates
- [Backups & Disaster Recovery](backups-disaster-recovery-chat.md) — observability of backups
- [VibeReference: Observability Providers](https://vibereference.dev/devops-and-tools/observability-providers) — Datadog / New Relic / Honeycomb / etc.
- [VibeReference: Time-Series Database Providers](https://vibereference.dev/backend-and-data/time-series-database-providers) — metrics storage
- [VibeReference: Error Monitoring Providers](https://vibereference.dev/devops-and-tools/error-monitoring-providers) — Sentry / Bugsnag