Metrics & OpenTelemetry Instrumentation: Numbers That Tell You Why At 3 AM
If you're shipping a SaaS in 2026 and only have logs (no metrics, no traces), debugging production is ten-times harder than it needs to be. Logs answer "what happened on this specific request." Metrics answer "is the system healthy in aggregate." Traces answer "where exactly is the slowdown happening across services." Most indie SaaS ships with logs only; gets paged at 3 AM; spends an hour grepping logs to find a pattern; realizes a metric dashboard would have shown the issue in seconds. The fix is a deliberate observability stack — OpenTelemetry (the 2026 vendor-neutral standard) for tracing + metrics, paired with a backend (Datadog / Honeycomb / Grafana / New Relic / Vercel Observability) that stores and queries.
A working metrics + tracing strategy answers: what to instrument (USE / RED / Golden Signals), how to instrument (OpenTelemetry SDK; auto-instrumentation), what backend (Datadog / Honeycomb / Grafana / Vercel Observability), how to define SLIs and SLOs, how to alert (without alert fatigue), and how to make this affordable at indie scale (sample wisely; don't pay $5K/mo for traces).
This guide is the implementation playbook for metrics + traces. Companion to Logging Strategy & Structured Logs, Performance Optimization, Service Level Agreements, Incident Response, and HTTP Retry & Backoff.
Why Metrics & Traces Matter
Get the failure modes clear first.
Help me understand what logs alone miss.
The 6 categories logs can't answer:
**1. Aggregate trends**
"Is p95 latency rising?" Logs: scrape thousands; computed by you. Metric: graph in seconds.
**2. Cross-request patterns**
"Is THIS endpoint slowing down for ALL users?" Logs: grep every request. Metric: dashboard.
**3. Distributed traces**
"User reports slow checkout. Where in the path?" Logs: grep across 4 services; correlate. Trace: visual span tree shows exact slow piece.
**4. Capacity planning**
"How busy is our DB?" Logs: not really. Metric: connection-pool utilization graph.
**5. Alerts on patterns**
"Alert if error rate > 5%." Logs: hard. Metric: alert rule.
**6. SLI / SLO tracking**
"What % of requests are < 200ms?" Logs: aggregate manually. Metric: built-in.
For my system:
- Top 5 production debugging pains
- Time spent grepping logs
Output:
1. Pains addressed by metrics
2. Pains addressed by traces
3. Priority order
The biggest unforced error: logs-only observability past 5 services. Distributed systems need traces; aggregate health needs metrics; debugging just gets harder without them.
The Three Pillars: Logs, Metrics, Traces
Help me understand the layers.
The 3 pillars:
**1. Logs** (you have these)
- Discrete events with context
- Use for: specific request investigation; auditing; error context
- Volume: high; cost: high if retained long
- Tools: Datadog Logs / Loki / CloudWatch / Vercel Logs
**2. Metrics**
- Time-series numbers
- Use for: aggregate health; alerting; trends
- Volume: low; cost: low
- Tools: Prometheus / VictoriaMetrics / Datadog / Grafana / Honeycomb
**3. Traces**
- Request paths across services with timing
- Use for: latency debugging; distributed-system understanding
- Volume: medium; cost: medium (sampling)
- Tools: Jaeger / Zipkin / Tempo / Datadog APM / Honeycomb / Vercel Observability
**The 2026 reality**:
OpenTelemetry (OTel) is the standard:
- Single SDK instruments your code once
- Outputs traces + metrics + logs in standard format
- Send to any backend (Datadog / Honeycomb / Grafana Cloud / New Relic / Tempo / etc.)
- Vendor-neutral; switch backends without re-instrumenting
**Backend options**:
Cloud-managed:
- **Datadog** — most-popular; expensive
- **New Relic** — alternative; $99 free tier 100GB
- **Honeycomb** — observability-2.0; great for traces
- **Grafana Cloud** — open-source-friendly; cost-effective
- **Vercel Observability** — bundled if Vercel-hosted
- **AWS CloudWatch** — bundled if AWS-locked
Self-host:
- **Grafana + Tempo + Loki + Mimir** — full open-source stack
- **SigNoz** — OSS Datadog alternative
- **Prometheus + Tempo + Loki** — composable
For my stack: [pick]
Output:
1. Three pillars status
2. Backend pick
3. Migration plan
The 2026 default: OpenTelemetry SDK + Honeycomb / Grafana Cloud / Vercel Observability for indie / mid-market. Datadog when budget allows or enterprise procurement requires.
OpenTelemetry: Instrument Once, Send Anywhere
Help me set up OpenTelemetry.
The basic setup:
```typescript
// Node.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
serviceName: 'my-saas',
traceExporter: new OTLPTraceExporter({
url: 'https://api.honeycomb.io/v1/traces',
headers: { 'x-honeycomb-team': process.env.HONEYCOMB_API_KEY },
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Auto-instrumentation captures:
- HTTP server / client
- Express / Fastify / Hono routes
- Postgres / MySQL / MongoDB queries
- Redis commands
- gRPC calls
- AWS SDK calls
You get traces immediately for free.
Custom spans:
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('my-saas');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', orderId);
try {
const result = await doWork();
span.setAttribute('result.size', result.length);
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
});
}
Custom metrics:
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('my-saas');
const checkoutCounter = meter.createCounter('checkouts_completed_total');
const checkoutDuration = meter.createHistogram('checkout_duration_ms');
async function checkout() {
const start = Date.now();
const result = await doCheckout();
checkoutCounter.add(1, { status: result.status });
checkoutDuration.record(Date.now() - start, { plan: result.plan });
return result;
}
Logs (via OTel):
OpenTelemetry Logs are still maturing. Most stacks: structured logs in their own library (pino / winston) + OTel for traces + metrics.
For my stack: [language]
Output:
- SDK setup
- Auto-instrumentation
- Custom spans / metrics
The win that compounds: **OpenTelemetry's vendor-neutrality**. Today: send to Honeycomb. Tomorrow: switch to Datadog without re-instrumenting. Lock-in avoided.
## Metric Frameworks: USE, RED, Golden Signals
Help me decide what to measure.
The 3 frameworks:
USE (Brendan Gregg): per resource (CPU / memory / disk / network)
- Utilization: % busy
- Saturation: queue depth
- Errors: error count
For: infrastructure monitoring (server CPU; disk).
RED (Tom Wilkie): per service / endpoint
- Rate: requests / second
- Errors: error count or rate
- Duration: latency distribution
For: service-level monitoring (API endpoints).
Golden Signals (Google SRE): per service
- Latency: response time distribution
- Traffic: request rate
- Errors: error rate
- Saturation: how full the system is
For: holistic service monitoring (combines RED + USE).
For most SaaS in 2026:
Per HTTP endpoint, track:
- Request rate (req/s)
- Error rate (% errors)
- p50 / p95 / p99 latency
Per background job:
- Job rate (jobs/s)
- Failure rate
- Duration distribution
Per critical resource:
- DB connection pool utilization
- Redis memory
- Queue depth
For my service: [endpoints]
Output:
- Framework pick
- Metrics per endpoint
- Per-resource metrics
The pragmatic 2026 default: **RED for services + USE for infrastructure**. Golden Signals is the union; same idea different naming.
## SLIs and SLOs
Help me set SLIs and SLOs.
SLI (Service Level Indicator): a metric measuring service quality.
Examples:
- "% of requests completing in <200ms"
- "% of requests succeeding (no 5xx)"
- "% of background jobs completing within SLA"
SLO (Service Level Objective): the target for an SLI.
Examples:
- "99% of requests in <200ms (over 30 days)"
- "99.9% success rate (over 30 days)"
- "95% of background jobs in <5 min"
Error budget:
100% - SLO = error budget. 99% SLO → 1% error budget = ~7 hours/month of allowed missed-SLO.
When you exhaust budget: stop releasing risky changes; focus on reliability.
Common SLOs for SaaS:
- API: 99.9% success; 99% < 200ms
- Login flow: 99.99% success
- Webhook delivery: 95% within 60s; 99% within 5 min
- Email send: 99% within 5 min
Implementing:
Define SLI in observability tool:
sli:
request_success:
metric: http_requests_total{status!~"5.."}
threshold: 99.9% over 30d
request_latency:
metric: http_request_duration_ms
threshold: 95% < 200ms over 30d
Tools:
- Honeycomb / Datadog / Grafana have SLO features
- Manual via metric query + alert
For my service:
- Critical endpoints
- SLO targets
Output:
- SLI list
- SLOs per
- Error-budget tracking
The discipline: **SLOs drive prioritization**. When error budget is depleted, ship reliability work; don't ship features. This converts "should we improve quality?" debates into objective decisions.
## Tracing in Practice
Help me use traces.
Trace = full request path with timing per span.
Example: customer reports slow checkout.
Without traces:
- Grep logs for their request
- Correlate timestamps across services
- Estimate which piece is slow
- Guess
With traces:
- Search Honeycomb / Datadog for slow checkouts
- Click a slow trace
- Visual: API gateway 5ms → auth 10ms → checkout 4500ms → payment 100ms
- → Checkout function is the slow piece; investigate
Auto-instrumentation traces:
OpenTelemetry auto-instrumentation gives you:
- HTTP request spans
- DB query spans (with SQL)
- HTTP client spans (calls to external APIs)
- Redis command spans
For free.
Adding custom spans:
For business logic worth tracing:
await tracer.startActiveSpan('parse_csv', async (span) => {
span.setAttribute('csv.row_count', rows.length);
// ...
span.end();
});
Sampling:
Traces are voluminous. Most apps sample.
Strategies:
- Head-based: decide at request start (e.g. 10% of requests)
- Tail-based: decide at request end (e.g. all errors + 1% of normal)
- Adaptive: more sampling on errors / slow requests
Most tools (Honeycomb / Tempo / Vercel) support tail-based.
Cost management:
100K req/day × all traces × 30 days = expensive. 100K req/day × 1% sampled + all errors = affordable.
For my product:
- Trace volume estimate
- Sampling strategy
Output:
- Trace coverage
- Custom spans
- Sampling
The single highest-leverage observability investment: **tail-based sampling with all errors traced + 1% normal**. Catches all anomalies; costs 99% less.
## Backends: Cost vs Features
Help me pick a backend.
The 2026 landscape:
Datadog:
- Most-popular; comprehensive
- Pricing: $15-50/host + $1-5 per million metrics + $1-3 per GB logs + $1-3 per million traces
- Total: $500-5000/mo at indie scale; $20K-200K/yr at mid-market
- Pros: best UX; most integrations
- Cons: very expensive
Honeycomb:
- Observability-2.0 (events-based; queryable)
- Pricing: Free 20M events/mo; $130/mo for 100M events; tiered
- Pros: best for traces; great query language; reasonable price
- Cons: events model different from traditional metrics
Grafana Cloud:
- Tempo (traces) + Mimir (metrics) + Loki (logs)
- Pricing: free 10K series metrics; $99/mo for 100GB logs; etc.
- Pros: cost-effective; OSS-aligned; powerful
- Cons: setup complexity
New Relic:
- All-in-one observability
- Pricing: 100GB free; $0.30/GB after; user-based seats
- Pros: free tier is real (most usable free tier)
- Cons: dated UX in places
Vercel Observability:
- Bundled with Vercel deployments
- Free tier; usage-based
- Pros: zero-config for Vercel apps
- Cons: less powerful than dedicated tools
SigNoz (OSS Datadog alternative):
- Self-host; cloud option
- Free OSS; cloud $200/mo+
- Pros: cost-effective; modern
- Cons: ops burden if self-host
Sentry:
- Errors-focused; performance monitoring layered
- Pricing: $26-89/mo team; usage-based
- Pros: best error tracking
- Cons: not full APM
The 2026 default for indie: Vercel Observability (if Vercel-hosted) OR Honeycomb Free OR Grafana Cloud Free.
For mid-market: Honeycomb OR Grafana Cloud OR New Relic.
For enterprise: Datadog if budget OK; New Relic alternative.
For my stack: [pick]
Output:
- Backend pick
- Cost estimate
- Migration plan
The 2026 cost reality: **Datadog at indie scale = $500-2000/mo; same observability via Honeycomb / Grafana Cloud = $50-300/mo**. Pick by cost; performance at indie scale is not the differentiator.
## Alerting Without Fatigue
Help me alert correctly.
The 4 alert principles:
1. Alert on symptoms, not causes
Bad: "DB CPU > 80%" (cause; might not affect users) Good: "p95 latency > 500ms for 5 min" (symptom; affects users)
2. Alert on what wakes someone
If alert fires at 3 AM, on-call wakes up. Make it worth waking.
Per alert: "Is this worth paging at 3 AM?" If no: lower severity (Slack channel; not page).
3. Hierarchy of severity
- Critical: page immediately (production-down; SLO-breach)
- High: page during business hours
- Medium: Slack alert; team reviews
- Low: dashboard only
4. Reduce false-positive rate
False alerts → alert fatigue → real alerts ignored.
- Tune thresholds based on baselines
- Use rate-of-change instead of absolute (% of normal)
- Multi-condition alerts (latency + error rate + duration)
Standard alerts for SaaS:
- Production 5xx rate > 1% for 5 min → critical
- p95 latency > [SLO threshold] for 10 min → critical
- DB connection pool > 90% for 5 min → high
- Disk full → 80% → high; 95% → critical
- Background job queue depth growing for 10 min → high
- Auth failure rate > 5% for 5 min → critical
Runbook per alert:
Each alert links to runbook (Notion / wiki) with:
- What's the alert
- Likely causes
- Diagnostic steps
- Mitigation steps
- Escalation
For my alerts: [audit]
Output:
- Alert list
- Severity tiers
- Runbooks
The discipline: **runbook per alert**. On-call wakes; alert fires; runbook gives steps. Without: 30 minutes of fumbling. With: action in 5 min.
## Common Observability Mistakes
Help me avoid mistakes.
The 10 mistakes:
1. Logs only; no metrics or traces Debugging takes 10x longer.
2. No OpenTelemetry; vendor-locked Switching backend = re-instrument.
3. Sampling 100% of traces at scale Cost explodes.
4. Alerting on causes (CPU); not symptoms (latency) Pages on irrelevant; misses real.
5. No SLOs Reliability vs feature debates have no objective resolution.
6. Every alert is "critical" Alert fatigue; real alerts missed.
7. No runbooks On-call wakes; doesn't know what to do.
8. Datadog at indie scale $2K/mo when $200/mo would do.
9. No custom metrics for business logic "Is checkout working?" not measurable.
10. Observability as afterthought Bolt on at year 3; technical debt + production pain accumulated.
For my system: [risks]
Output:
- Top 3 risks
- Mitigations
- Audit
The single most-painful mistake: **shipping without observability and adding it after a major outage**. Outage happens; you can't see what's wrong; recovery is slow; trust damage. Instrument from Day 1.
## What Done Looks Like
A working observability stack:
- OpenTelemetry SDK in all services
- Auto-instrumentation for HTTP / DB / Redis
- Custom spans for business logic
- Custom metrics for business KPIs (checkouts / signups / etc.)
- Backend: Honeycomb / Grafana Cloud / Vercel Observability / Datadog
- Tail-based sampling (all errors + 1% normal)
- 5-10 SLOs with error-budget tracking
- 5-15 alerts (mostly symptoms, not causes)
- Runbook per alert
- Cost: <2% of revenue at indie scale
The proof you got it right: when production breaks at 3 AM, on-call gets paged with specific alert; opens runbook; identifies cause via traces in 10 min; fixes in 30. Compares to logs-only: 2-hour debugging session.
## See Also
- [Logging Strategy & Structured Logs](logging-strategy-structured-logs-chat.md) — companion log layer
- [Performance Optimization](performance-optimization-chat.md) — metrics inform optimization
- [Service Level Agreements](service-level-agreements-chat.md) — SLAs depend on SLOs
- [Incident Response](incident-response-chat.md) — alerts → incident response
- [HTTP Retry & Backoff](http-retry-backoff-chat.md) — instrument retries
- [Multi-region Deployment](multi-region-deployment-chat.md) — region-tagged metrics
- [Database Indexing Strategy](database-indexing-strategy-chat.md) — query metrics drive indexing
- [Customer Analytics Dashboards](customer-analytics-dashboards-chat.md) — adjacent product analytics
- [Status Page (vibeweek)](status-page-chat.md) — SLO breaches trigger status updates
- [Backups & Disaster Recovery](backups-disaster-recovery-chat.md) — observability of backups
- [VibeReference: Observability Providers](https://vibereference.dev/devops-and-tools/observability-providers) — Datadog / New Relic / Honeycomb / etc.
- [VibeReference: Time-Series Database Providers](https://vibereference.dev/backend-and-data/time-series-database-providers) — metrics storage
- [VibeReference: Error Monitoring Providers](https://vibereference.dev/devops-and-tools/error-monitoring-providers) — Sentry / Bugsnag