VibeWeek
Home/Grow/Metrics & OpenTelemetry Instrumentation: Numbers That Tell You Why At 3 AM

Metrics & OpenTelemetry Instrumentation: Numbers That Tell You Why At 3 AM

⬅️ Day 6: Grow Overview

If you're shipping a SaaS in 2026 and only have logs (no metrics, no traces), debugging production is ten-times harder than it needs to be. Logs answer "what happened on this specific request." Metrics answer "is the system healthy in aggregate." Traces answer "where exactly is the slowdown happening across services." Most indie SaaS ships with logs only; gets paged at 3 AM; spends an hour grepping logs to find a pattern; realizes a metric dashboard would have shown the issue in seconds. The fix is a deliberate observability stack — OpenTelemetry (the 2026 vendor-neutral standard) for tracing + metrics, paired with a backend (Datadog / Honeycomb / Grafana / New Relic / Vercel Observability) that stores and queries.

A working metrics + tracing strategy answers: what to instrument (USE / RED / Golden Signals), how to instrument (OpenTelemetry SDK; auto-instrumentation), what backend (Datadog / Honeycomb / Grafana / Vercel Observability), how to define SLIs and SLOs, how to alert (without alert fatigue), and how to make this affordable at indie scale (sample wisely; don't pay $5K/mo for traces).

This guide is the implementation playbook for metrics + traces. Companion to Logging Strategy & Structured Logs, Performance Optimization, Service Level Agreements, Incident Response, and HTTP Retry & Backoff.

Why Metrics & Traces Matter

Get the failure modes clear first.

Help me understand what logs alone miss.

The 6 categories logs can't answer:

**1. Aggregate trends**
"Is p95 latency rising?" Logs: scrape thousands; computed by you. Metric: graph in seconds.

**2. Cross-request patterns**
"Is THIS endpoint slowing down for ALL users?" Logs: grep every request. Metric: dashboard.

**3. Distributed traces**
"User reports slow checkout. Where in the path?" Logs: grep across 4 services; correlate. Trace: visual span tree shows exact slow piece.

**4. Capacity planning**
"How busy is our DB?" Logs: not really. Metric: connection-pool utilization graph.

**5. Alerts on patterns**
"Alert if error rate > 5%." Logs: hard. Metric: alert rule.

**6. SLI / SLO tracking**
"What % of requests are < 200ms?" Logs: aggregate manually. Metric: built-in.

For my system:
- Top 5 production debugging pains
- Time spent grepping logs

Output:
1. Pains addressed by metrics
2. Pains addressed by traces
3. Priority order

The biggest unforced error: logs-only observability past 5 services. Distributed systems need traces; aggregate health needs metrics; debugging just gets harder without them.

The Three Pillars: Logs, Metrics, Traces

Help me understand the layers.

The 3 pillars:

**1. Logs** (you have these)
- Discrete events with context
- Use for: specific request investigation; auditing; error context
- Volume: high; cost: high if retained long
- Tools: Datadog Logs / Loki / CloudWatch / Vercel Logs

**2. Metrics**
- Time-series numbers
- Use for: aggregate health; alerting; trends
- Volume: low; cost: low
- Tools: Prometheus / VictoriaMetrics / Datadog / Grafana / Honeycomb

**3. Traces**
- Request paths across services with timing
- Use for: latency debugging; distributed-system understanding
- Volume: medium; cost: medium (sampling)
- Tools: Jaeger / Zipkin / Tempo / Datadog APM / Honeycomb / Vercel Observability

**The 2026 reality**:

OpenTelemetry (OTel) is the standard:
- Single SDK instruments your code once
- Outputs traces + metrics + logs in standard format
- Send to any backend (Datadog / Honeycomb / Grafana Cloud / New Relic / Tempo / etc.)
- Vendor-neutral; switch backends without re-instrumenting

**Backend options**:

Cloud-managed:
- **Datadog** — most-popular; expensive
- **New Relic** — alternative; $99 free tier 100GB
- **Honeycomb** — observability-2.0; great for traces
- **Grafana Cloud** — open-source-friendly; cost-effective
- **Vercel Observability** — bundled if Vercel-hosted
- **AWS CloudWatch** — bundled if AWS-locked

Self-host:
- **Grafana + Tempo + Loki + Mimir** — full open-source stack
- **SigNoz** — OSS Datadog alternative
- **Prometheus + Tempo + Loki** — composable

For my stack: [pick]

Output:
1. Three pillars status
2. Backend pick
3. Migration plan

The 2026 default: OpenTelemetry SDK + Honeycomb / Grafana Cloud / Vercel Observability for indie / mid-market. Datadog when budget allows or enterprise procurement requires.

OpenTelemetry: Instrument Once, Send Anywhere

Help me set up OpenTelemetry.

The basic setup:

```typescript
// Node.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  serviceName: 'my-saas',
  traceExporter: new OTLPTraceExporter({
    url: 'https://api.honeycomb.io/v1/traces',
    headers: { 'x-honeycomb-team': process.env.HONEYCOMB_API_KEY },
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Auto-instrumentation captures:

  • HTTP server / client
  • Express / Fastify / Hono routes
  • Postgres / MySQL / MongoDB queries
  • Redis commands
  • gRPC calls
  • AWS SDK calls

You get traces immediately for free.

Custom spans:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('my-saas');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', orderId);
    try {
      const result = await doWork();
      span.setAttribute('result.size', result.length);
      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

Custom metrics:

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('my-saas');
const checkoutCounter = meter.createCounter('checkouts_completed_total');
const checkoutDuration = meter.createHistogram('checkout_duration_ms');

async function checkout() {
  const start = Date.now();
  const result = await doCheckout();
  checkoutCounter.add(1, { status: result.status });
  checkoutDuration.record(Date.now() - start, { plan: result.plan });
  return result;
}

Logs (via OTel):

OpenTelemetry Logs are still maturing. Most stacks: structured logs in their own library (pino / winston) + OTel for traces + metrics.

For my stack: [language]

Output:

  1. SDK setup
  2. Auto-instrumentation
  3. Custom spans / metrics

The win that compounds: **OpenTelemetry's vendor-neutrality**. Today: send to Honeycomb. Tomorrow: switch to Datadog without re-instrumenting. Lock-in avoided.

## Metric Frameworks: USE, RED, Golden Signals

Help me decide what to measure.

The 3 frameworks:

USE (Brendan Gregg): per resource (CPU / memory / disk / network)

  • Utilization: % busy
  • Saturation: queue depth
  • Errors: error count

For: infrastructure monitoring (server CPU; disk).

RED (Tom Wilkie): per service / endpoint

  • Rate: requests / second
  • Errors: error count or rate
  • Duration: latency distribution

For: service-level monitoring (API endpoints).

Golden Signals (Google SRE): per service

  • Latency: response time distribution
  • Traffic: request rate
  • Errors: error rate
  • Saturation: how full the system is

For: holistic service monitoring (combines RED + USE).

For most SaaS in 2026:

Per HTTP endpoint, track:

  • Request rate (req/s)
  • Error rate (% errors)
  • p50 / p95 / p99 latency

Per background job:

  • Job rate (jobs/s)
  • Failure rate
  • Duration distribution

Per critical resource:

  • DB connection pool utilization
  • Redis memory
  • Queue depth

For my service: [endpoints]

Output:

  1. Framework pick
  2. Metrics per endpoint
  3. Per-resource metrics

The pragmatic 2026 default: **RED for services + USE for infrastructure**. Golden Signals is the union; same idea different naming.

## SLIs and SLOs

Help me set SLIs and SLOs.

SLI (Service Level Indicator): a metric measuring service quality.

Examples:

  • "% of requests completing in <200ms"
  • "% of requests succeeding (no 5xx)"
  • "% of background jobs completing within SLA"

SLO (Service Level Objective): the target for an SLI.

Examples:

  • "99% of requests in <200ms (over 30 days)"
  • "99.9% success rate (over 30 days)"
  • "95% of background jobs in <5 min"

Error budget:

100% - SLO = error budget. 99% SLO → 1% error budget = ~7 hours/month of allowed missed-SLO.

When you exhaust budget: stop releasing risky changes; focus on reliability.

Common SLOs for SaaS:

  • API: 99.9% success; 99% < 200ms
  • Login flow: 99.99% success
  • Webhook delivery: 95% within 60s; 99% within 5 min
  • Email send: 99% within 5 min

Implementing:

Define SLI in observability tool:

sli:
  request_success:
    metric: http_requests_total{status!~"5.."}
    threshold: 99.9% over 30d
  request_latency:
    metric: http_request_duration_ms
    threshold: 95% < 200ms over 30d

Tools:

  • Honeycomb / Datadog / Grafana have SLO features
  • Manual via metric query + alert

For my service:

  • Critical endpoints
  • SLO targets

Output:

  1. SLI list
  2. SLOs per
  3. Error-budget tracking

The discipline: **SLOs drive prioritization**. When error budget is depleted, ship reliability work; don't ship features. This converts "should we improve quality?" debates into objective decisions.

## Tracing in Practice

Help me use traces.

Trace = full request path with timing per span.

Example: customer reports slow checkout.

Without traces:

  • Grep logs for their request
  • Correlate timestamps across services
  • Estimate which piece is slow
  • Guess

With traces:

  • Search Honeycomb / Datadog for slow checkouts
  • Click a slow trace
  • Visual: API gateway 5ms → auth 10ms → checkout 4500ms → payment 100ms
  • → Checkout function is the slow piece; investigate

Auto-instrumentation traces:

OpenTelemetry auto-instrumentation gives you:

  • HTTP request spans
  • DB query spans (with SQL)
  • HTTP client spans (calls to external APIs)
  • Redis command spans

For free.

Adding custom spans:

For business logic worth tracing:

await tracer.startActiveSpan('parse_csv', async (span) => {
  span.setAttribute('csv.row_count', rows.length);
  // ...
  span.end();
});

Sampling:

Traces are voluminous. Most apps sample.

Strategies:

  • Head-based: decide at request start (e.g. 10% of requests)
  • Tail-based: decide at request end (e.g. all errors + 1% of normal)
  • Adaptive: more sampling on errors / slow requests

Most tools (Honeycomb / Tempo / Vercel) support tail-based.

Cost management:

100K req/day × all traces × 30 days = expensive. 100K req/day × 1% sampled + all errors = affordable.

For my product:

  • Trace volume estimate
  • Sampling strategy

Output:

  1. Trace coverage
  2. Custom spans
  3. Sampling

The single highest-leverage observability investment: **tail-based sampling with all errors traced + 1% normal**. Catches all anomalies; costs 99% less.

## Backends: Cost vs Features

Help me pick a backend.

The 2026 landscape:

Datadog:

  • Most-popular; comprehensive
  • Pricing: $15-50/host + $1-5 per million metrics + $1-3 per GB logs + $1-3 per million traces
  • Total: $500-5000/mo at indie scale; $20K-200K/yr at mid-market
  • Pros: best UX; most integrations
  • Cons: very expensive

Honeycomb:

  • Observability-2.0 (events-based; queryable)
  • Pricing: Free 20M events/mo; $130/mo for 100M events; tiered
  • Pros: best for traces; great query language; reasonable price
  • Cons: events model different from traditional metrics

Grafana Cloud:

  • Tempo (traces) + Mimir (metrics) + Loki (logs)
  • Pricing: free 10K series metrics; $99/mo for 100GB logs; etc.
  • Pros: cost-effective; OSS-aligned; powerful
  • Cons: setup complexity

New Relic:

  • All-in-one observability
  • Pricing: 100GB free; $0.30/GB after; user-based seats
  • Pros: free tier is real (most usable free tier)
  • Cons: dated UX in places

Vercel Observability:

  • Bundled with Vercel deployments
  • Free tier; usage-based
  • Pros: zero-config for Vercel apps
  • Cons: less powerful than dedicated tools

SigNoz (OSS Datadog alternative):

  • Self-host; cloud option
  • Free OSS; cloud $200/mo+
  • Pros: cost-effective; modern
  • Cons: ops burden if self-host

Sentry:

  • Errors-focused; performance monitoring layered
  • Pricing: $26-89/mo team; usage-based
  • Pros: best error tracking
  • Cons: not full APM

The 2026 default for indie: Vercel Observability (if Vercel-hosted) OR Honeycomb Free OR Grafana Cloud Free.

For mid-market: Honeycomb OR Grafana Cloud OR New Relic.

For enterprise: Datadog if budget OK; New Relic alternative.

For my stack: [pick]

Output:

  1. Backend pick
  2. Cost estimate
  3. Migration plan

The 2026 cost reality: **Datadog at indie scale = $500-2000/mo; same observability via Honeycomb / Grafana Cloud = $50-300/mo**. Pick by cost; performance at indie scale is not the differentiator.

## Alerting Without Fatigue

Help me alert correctly.

The 4 alert principles:

1. Alert on symptoms, not causes

Bad: "DB CPU > 80%" (cause; might not affect users) Good: "p95 latency > 500ms for 5 min" (symptom; affects users)

2. Alert on what wakes someone

If alert fires at 3 AM, on-call wakes up. Make it worth waking.

Per alert: "Is this worth paging at 3 AM?" If no: lower severity (Slack channel; not page).

3. Hierarchy of severity

  • Critical: page immediately (production-down; SLO-breach)
  • High: page during business hours
  • Medium: Slack alert; team reviews
  • Low: dashboard only

4. Reduce false-positive rate

False alerts → alert fatigue → real alerts ignored.

  • Tune thresholds based on baselines
  • Use rate-of-change instead of absolute (% of normal)
  • Multi-condition alerts (latency + error rate + duration)

Standard alerts for SaaS:

  • Production 5xx rate > 1% for 5 min → critical
  • p95 latency > [SLO threshold] for 10 min → critical
  • DB connection pool > 90% for 5 min → high
  • Disk full → 80% → high; 95% → critical
  • Background job queue depth growing for 10 min → high
  • Auth failure rate > 5% for 5 min → critical

Runbook per alert:

Each alert links to runbook (Notion / wiki) with:

  • What's the alert
  • Likely causes
  • Diagnostic steps
  • Mitigation steps
  • Escalation

For my alerts: [audit]

Output:

  1. Alert list
  2. Severity tiers
  3. Runbooks

The discipline: **runbook per alert**. On-call wakes; alert fires; runbook gives steps. Without: 30 minutes of fumbling. With: action in 5 min.

## Common Observability Mistakes

Help me avoid mistakes.

The 10 mistakes:

1. Logs only; no metrics or traces Debugging takes 10x longer.

2. No OpenTelemetry; vendor-locked Switching backend = re-instrument.

3. Sampling 100% of traces at scale Cost explodes.

4. Alerting on causes (CPU); not symptoms (latency) Pages on irrelevant; misses real.

5. No SLOs Reliability vs feature debates have no objective resolution.

6. Every alert is "critical" Alert fatigue; real alerts missed.

7. No runbooks On-call wakes; doesn't know what to do.

8. Datadog at indie scale $2K/mo when $200/mo would do.

9. No custom metrics for business logic "Is checkout working?" not measurable.

10. Observability as afterthought Bolt on at year 3; technical debt + production pain accumulated.

For my system: [risks]

Output:

  1. Top 3 risks
  2. Mitigations
  3. Audit

The single most-painful mistake: **shipping without observability and adding it after a major outage**. Outage happens; you can't see what's wrong; recovery is slow; trust damage. Instrument from Day 1.

## What Done Looks Like

A working observability stack:
- OpenTelemetry SDK in all services
- Auto-instrumentation for HTTP / DB / Redis
- Custom spans for business logic
- Custom metrics for business KPIs (checkouts / signups / etc.)
- Backend: Honeycomb / Grafana Cloud / Vercel Observability / Datadog
- Tail-based sampling (all errors + 1% normal)
- 5-10 SLOs with error-budget tracking
- 5-15 alerts (mostly symptoms, not causes)
- Runbook per alert
- Cost: <2% of revenue at indie scale

The proof you got it right: when production breaks at 3 AM, on-call gets paged with specific alert; opens runbook; identifies cause via traces in 10 min; fixes in 30. Compares to logs-only: 2-hour debugging session.

## See Also

- [Logging Strategy & Structured Logs](logging-strategy-structured-logs-chat.md) — companion log layer
- [Performance Optimization](performance-optimization-chat.md) — metrics inform optimization
- [Service Level Agreements](service-level-agreements-chat.md) — SLAs depend on SLOs
- [Incident Response](incident-response-chat.md) — alerts → incident response
- [HTTP Retry & Backoff](http-retry-backoff-chat.md) — instrument retries
- [Multi-region Deployment](multi-region-deployment-chat.md) — region-tagged metrics
- [Database Indexing Strategy](database-indexing-strategy-chat.md) — query metrics drive indexing
- [Customer Analytics Dashboards](customer-analytics-dashboards-chat.md) — adjacent product analytics
- [Status Page (vibeweek)](status-page-chat.md) — SLO breaches trigger status updates
- [Backups & Disaster Recovery](backups-disaster-recovery-chat.md) — observability of backups
- [VibeReference: Observability Providers](https://vibereference.dev/devops-and-tools/observability-providers) — Datadog / New Relic / Honeycomb / etc.
- [VibeReference: Time-Series Database Providers](https://vibereference.dev/backend-and-data/time-series-database-providers) — metrics storage
- [VibeReference: Error Monitoring Providers](https://vibereference.dev/devops-and-tools/error-monitoring-providers) — Sentry / Bugsnag