In-Product AI Agent Implementation — Chat Prompts

⬅️ Back to 6. Grow

If you're a B2B SaaS in 2026 and customers are asking "can it just do this for me?" — you're hitting the moment where in-product AI agents become a differentiator. Not a chatbot ("ask me a question"); not a copilot ("help me write this"); a real agent — autonomous, multi-step, takes actions in your product, executes work that previously required a human. Linear's auto-triage, Notion's AI agents, Stripe's Optimized Checkout, HubSpot's Breeze Agents, Cursor's Background Agents, Anthropic's Claude Code — every modern SaaS in 2026 is shipping this shape. The competitive frontier moved from "we have AI features" to "we have an agent that does X for the customer."

The naive shape: "we'll just call OpenAI and it'll figure it out." This works for a 2-step demo and breaks at production scale: agents loop forever, hit rate limits, take destructive actions you didn't expect, blow through budget, fail mysteriously, and produce inconsistent results. The right shape: an agent loop with bounded tool calls, durable execution, structured outputs, evaluation harness, cost controls, safety review for destructive actions, audit logging, and a graceful degradation path. This is hard engineering — the gap between "demo agent" and "production agent customers trust" is months of work.

This chat walks through implementing a production-grade in-product AI agent: the loop architecture, tool design, durable execution, safety/sandboxing, evaluations, cost controls, observability, and the operational realities of running customer-defined agents at scale.

What you're building

An agent loop (LLM + tool-call + iterate) with bounded depth
A tool catalog (the actions the agent can take in your product)
Tool implementations with proper auth + scope + safety
Durable execution (resume on crash; long-running)
Streaming output to UI (visible progress for the customer)
Cost controls (per-customer, per-task budget)
Observability (every step logged + replayable)
Evaluations (regression-test agent behavior)
Safety / approval flows for destructive actions
Customer-facing analytics (this agent ran X times, succeeded Y%, cost $Z)

1. Decide the agent shape FIRST

Help me decide what shape of agent I'm building.

Five common in-product agent shapes — pick consciously:

SHAPE 1: COPILOT (suggestion-only; human approves)
- Agent suggests; user clicks accept
- Examples: Cursor in-line suggestions, Notion AI, Linear copy AI
- Pros: low risk; high transparency; trust builds gradually
- Cons: incremental productivity gain; not autonomous

SHAPE 2: SCOPED EXECUTOR (one-shot task; defined boundary)
- User says "do this specific thing"; agent does it; reports back
- Examples: "Generate report"; "Summarize this thread"; "Draft this email"
- Pros: bounded scope; low risk
- Cons: customer must invoke each time; not proactive

SHAPE 3: AUTONOMOUS LOOP (multi-step; takes actions; reports)
- Customer assigns goal; agent plans + acts + adjusts; reports outcome
- Examples: Linear Triage, HubSpot Breeze Agents, Stripe payment optimization
- Pros: real productivity; "set and forget"
- Cons: bigger surface area; safety / cost / loop risks

SHAPE 4: BACKGROUND AGENT (continuous; monitors + acts)
- Agent runs in background watching for triggers
- Examples: Cursor background agents, Vercel Agent investigations
- Pros: proactive; finds issues humans miss
- Cons: ongoing cost; harder to debug

SHAPE 5: AGENT-AS-USER (full proxy)
- Agent has its own identity in the product; takes actions as itself
- Examples: Devin AI assigned to repo; Claude Code as PR reviewer
- Pros: ultimate autonomy
- Cons: most safety-critical; legal/audit complexity

DEFAULT FOR MOST SaaS IN 2026:
- v1: Shape 1 (Copilot) for first AI feature
- v2: Shape 2 (Scoped Executor) for specific high-value tasks
- v3: Shape 3 (Autonomous Loop) when v2 has earned customer trust
- Don't pre-build Shape 4-5 without specific buyer demand

Which to build FIRST:
- New to AI features: Shape 1
- Already have AI features; ready to expand: Shape 2 → Shape 3
- Power users asking for autonomy: Shape 3 (with strong safety controls)

Output: explicit shape choice + scope statement + what's NOT in v1.

Output: scope statement preventing scope creep.

2. Design the agent loop architecture

For Shape 3 (autonomous loop) — the most common 2026 implementation — here's the architecture.

Core loop:

while !done && step_count < max_steps:
  1. Build context window (system prompt + tools + history + state)
  2. Call LLM
  3. Receive response (text + optional tool_calls)
  4. If tool_calls:
     a. Validate each tool call (schema, permissions, safety)
     b. Execute tool calls (in parallel if independent; serially if dependent)
     c. Append results to history
  5. If text response (no tool calls): consider this step "done" or final
  6. Increment step counter
  7. Check stopping condition: success / max-steps / budget-exceeded / user-cancelled

Implementation building blocks:

Type definitions:

type AgentStep = {
  step_number: number
  thinking?: string  // model's reasoning if extended-thinking enabled
  tool_calls?: ToolCall[]
  tool_results?: ToolResult[]
  text_output?: string
  cost: { input_tokens: number; output_tokens: number; usd: number }
  duration_ms: number
}

type AgentRun = {
  id: string
  customer_id: string
  task: string
  status: 'pending' | 'running' | 'awaiting_approval' | 'completed' | 'failed' | 'cancelled' | 'budget_exceeded'
  steps: AgentStep[]
  total_cost_usd: number
  budget_usd: number
  max_steps: number
  started_at: Date
  completed_at?: Date
  result?: any
}

Loop pseudo-code (TypeScript / Vercel AI SDK / your stack):

async function runAgent(run: AgentRun, options: AgentOptions): Promise<AgentRun> {
  while (run.status === 'running' && run.steps.length < options.max_steps) {
    if (run.total_cost_usd >= run.budget_usd) {
      run.status = 'budget_exceeded'
      break
    }
    
    const stepNumber = run.steps.length + 1
    const startTime = Date.now()
    
    const result = await streamText({
      model: 'anthropic/claude-sonnet-4-7',  // via Vercel AI Gateway
      system: buildSystemPrompt(run),
      messages: buildMessageHistory(run),
      tools: getToolsForCustomer(run.customer_id),
      maxOutputTokens: 4096,
    })
    
    const step: AgentStep = {
      step_number: stepNumber,
      tool_calls: result.toolCalls,
      tool_results: [],
      text_output: result.text,
      cost: computeCost(result.usage, model),
      duration_ms: Date.now() - startTime,
    }
    
    run.total_cost_usd += step.cost.usd
    
    // Execute tool calls (with safety + auth checks)
    if (result.toolCalls.length > 0) {
      step.tool_results = await executeToolCalls(result.toolCalls, run.customer_id)
    }
    
    run.steps.push(step)
    await persistRun(run)  // durable; resumable
    
    // Stop conditions
    if (!result.toolCalls.length && result.text) {
      run.status = 'completed'
      run.result = result.text
      break
    }
    
    // Check for "approval required" tool results
    if (step.tool_results.some(r => r.requires_approval)) {
      run.status = 'awaiting_approval'
      break
    }
  }
  
  if (run.steps.length >= options.max_steps) {
    run.status = 'failed'
    run.result = 'max_steps_exceeded'
  }
  
  return run
}

Key design decisions:

1. Max steps (typical 10-30)
- Prevents infinite loops
- Set per agent type (simple agents: 5; complex: 20)
- Customer-facing message when hit

2. Budget per run (typical $0.10 - $5.00)
- Prevents runaway cost
- Track per-customer monthly aggregate too
- Hard stop when exceeded

3. Persistence (every step)
- Save state to DB at each iteration
- Allows resume on crash
- Allows replay for debugging

4. Streaming UI
- Stream tokens to UI for visible progress
- Update task status as steps complete
- Customer sees the agent "thinking"

5. Approval flow
- Some tools require human approval (delete, send email, charge card)
- Pause agent; notify user; resume on approval

Walk me through:
1. The full agent loop implementation
2. The persistence layer (how state survives crashes)
3. The streaming-to-UI integration
4. The cost tracking + budget enforcement
5. The max-steps + termination handling
6. The approval pause/resume flow

Output: an agent loop that survives production conditions.

3. Design the tool catalog

Tools are how your agent ACTS. Designing them is the most important part.

Tool design principles:

1. Each tool does ONE thing (single responsibility)
- Bad: "manage_task" (too vague; agent confused about what it does)
- Good: "create_task", "update_task_status", "assign_task" (each clear)

2. Tools have STRUCTURED outputs
- Bad: tool returns "Task created successfully"
- Good: tool returns { task_id, title, status, assignee }

3. Tools VALIDATE input
- Use Zod / Pydantic / your schema lib
- Reject invalid input with clear error message
- Don't crash on malformed input

4. Tools enforce PERMISSIONS
- The agent acts AS A specific user
- Tool checks: "can this user do this?"
- If no: return error; don't execute

5. Tools are IDEMPOTENT where possible
- Re-running same tool with same input → same result
- Use idempotency keys for create operations

6. Tools have CLEAR descriptions
- LLM reads description to decide when to call
- Bad: "Creates things"
- Good: "Create a new task in a project. Returns the new task. Use this when the user wants to add work to a project."

7. Tools handle ERRORS gracefully
- Don't throw exceptions to the agent loop (will loop trying the same thing)
- Return structured error: { error: "permission_denied", message: "User cannot create tasks in this project" }

Sample tool catalog for a project-management SaaS:

tool_catalog = [
  {
    name: 'list_projects',
    description: 'List the user\'s active projects. Returns up to 50.',
    input_schema: { workspace_id: 'uuid' },
    output_schema: { projects: [{ id, name, status, member_count }] },
    side_effect: false,
    requires_approval: false,
  },
  {
    name: 'list_tasks_in_project',
    description: 'List tasks in a specific project. Filters: status, assignee, due-date.',
    input_schema: { project_id, filters?: {status?, assignee?, due_before?} },
    output_schema: { tasks: [{...}] },
    side_effect: false,
    requires_approval: false,
  },
  {
    name: 'create_task',
    description: 'Create a new task in a project.',
    input_schema: { project_id, title, description?, assignee_id?, due_date?, priority? },
    output_schema: { task: {...} },
    side_effect: true,
    requires_approval: false,
  },
  {
    name: 'update_task_status',
    description: 'Change the status of an existing task (e.g., todo → in_progress → done).',
    input_schema: { task_id, new_status },
    output_schema: { task: {...} },
    side_effect: true,
    requires_approval: false,
  },
  {
    name: 'assign_task',
    description: 'Assign a task to a user.',
    input_schema: { task_id, user_id },
    output_schema: { task: {...} },
    side_effect: true,
    requires_approval: false,
  },
  {
    name: 'delete_task',
    description: 'Delete a task. DESTRUCTIVE - requires user approval.',
    input_schema: { task_id },
    output_schema: { task_id, deleted: true },
    side_effect: true,
    requires_approval: true,
  },
  {
    name: 'send_message_to_user',
    description: 'Send a direct message to a user in the workspace. DESTRUCTIVE - requires approval.',
    input_schema: { user_id, message },
    output_schema: { message_id },
    side_effect: true,
    requires_approval: true,
  },
  // ... etc
]

Read tools (no side effects): never require approval; agent should call freely.
Write tools: case-by-case; non-destructive can be auto; destructive should require approval.

Tool authoring guide:

A. Permissions check first:
async function executeListTasksInProject(input, context) {
  const project = await db.projects.findById(input.project_id)
  if (!project) return { error: 'not_found' }
  if (!await canRead(context.user_id, project)) {
    return { error: 'permission_denied', message: 'No access to this project' }
  }
  // ... actual logic
}

B. Validation:
const schema = z.object({
  project_id: z.string().uuid(),
  filters: z.object({
    status: z.enum(['todo', 'in_progress', 'done']).optional(),
    assignee: z.string().uuid().optional(),
  }).optional(),
})
const parsed = schema.safeParse(input)
if (!parsed.success) return { error: 'invalid_input', details: parsed.error }

C. Logging:
Every tool execution logged with input + output + duration + agent_run_id
For debugging + audit + replay

Implement:
1. The tool definition format + registry
2. Schema validation per tool
3. Permission checks per tool
4. The tool executor (dispatches by name)
5. Tool result types (success / error structures)
6. The tool documentation page (for customer transparency: "your agent can do these things")

Output: a tool catalog the agent can compose into useful workflows.

4. Implement durability (resumable; crash-safe)

Real agents run for minutes (sometimes hours). Your worker may crash. Your DB may hiccup. The customer may close their browser. Agent must resume.

Durability strategy: every step persists to DB; agent worker is stateless.

Schema:

agent_runs (
  id              uuid pk
  customer_id     uuid
  task            text
  status          text
  budget_usd      numeric
  total_cost_usd  numeric
  max_steps       int
  config          jsonb  -- model, system prompt, tool subset
  result          jsonb
  error           text
  created_at      timestamptz
  started_at      timestamptz
  completed_at    timestamptz
  -- updated each step
  current_step    int default 0
  paused_reason   text  -- 'awaiting_approval' | 'budget_warning' | etc.
)

agent_run_steps (
  run_id          uuid not null
  step_number     int not null
  thinking        text
  tool_calls      jsonb
  tool_results    jsonb
  text_output     text
  input_tokens    int
  output_tokens   int
  cost_usd        numeric
  duration_ms     int
  occurred_at     timestamptz
  PRIMARY KEY (run_id, step_number)
)

agent_run_tool_invocations (  -- normalized; for searchability
  id              uuid pk
  run_id          uuid
  step_number     int
  tool_name       text
  input           jsonb
  output          jsonb
  status          text  -- 'success' | 'error' | 'pending_approval'
  duration_ms     int
  occurred_at     timestamptz
)

Worker design:

Use a durable workflow runtime: Inngest, Trigger.dev, Temporal, or Vercel Workflow (in 2026 — Workflow DevKit is GA).

Vercel Workflow example:

import { workflow } from '@vercel/workflow-devkit'

export const runAgentWorkflow = workflow({
  name: 'run-agent',
  run: async (ctx, { run_id }) => {
    const run = await ctx.step('load', () => db.agentRuns.findById(run_id))
    
    while (run.status === 'running' && run.current_step < run.max_steps) {
      // Each step is a durable atom
      const step = await ctx.step(`step-${run.current_step + 1}`, async () => {
        return await executeAgentStep(run)
      })
      
      run.current_step += 1
      run.total_cost_usd += step.cost_usd
      
      await ctx.step('persist', () => db.agentRuns.update(run))
      
      if (step.requires_approval) {
        await ctx.waitForEvent(`approval-${run_id}`, { timeout: '7 days' })
        // Resumed when user approves
      }
      
      if (run.total_cost_usd >= run.budget_usd) {
        run.status = 'budget_exceeded'
        break
      }
      
      if (step.is_final) {
        run.status = 'completed'
        break
      }
    }
    
    return run
  },
})

Key durability patterns:

1. Each step is a durable atom
- Worker crashes mid-step → re-execute that step (idempotent)
- Each step result persisted before continuing

2. Tool calls are NOT idempotent by default
- create_task called twice creates two tasks
- Use idempotency keys at tool layer (e.g., create_task_idempotency_key = hash(run_id, step_number))
- Tool implementations check idempotency key before creating

3. Approval pauses don't cost compute
- ctx.waitForEvent durably suspends execution
- Resume on event receipt
- 7-day timeout (configurable)

4. Cost tracking persists per step
- Even if final result fails, you have the cost trail
- Helpful for debugging "why did this cost $40?"

5. Long-running workflows: use cron + chunking
- Don't run a 4-hour agent in one workflow
- Chunk: workflow runs 30 min; persists state; cron resumes

Implement:
1. The agent_runs schema
2. The durable workflow runner (Vercel Workflow / Inngest / Temporal)
3. Tool-level idempotency keys
4. The approval-waiting pattern
5. The chunking pattern for long agents
6. The crash-recovery testing

Output: agents that survive production.

5. Implement safety + approval flow

For Shape 3+ agents, customer trust depends on safety. Get this right.

Safety layers:

A. Tool-level approval requirement:
- Each tool flagged: requires_approval = true | false
- Read tools: never approval
- Non-destructive writes: usually no approval
- Destructive operations: ALWAYS approval

B. Risk classification at tool definition:
type RiskLevel = 'safe' | 'cautious' | 'destructive' | 'irreversible'

- safe: read-only, idempotent
- cautious: write but reversible (create_task can be deleted)
- destructive: write that's hard to undo (delete_task; can be undeleted within 30 days)
- irreversible: cannot be undone (send_email externally; charge_card; production_deploy)

C. Approval UX:

When tool requires approval:
- Agent pauses
- UI shows: "Agent wants to do X" with description of action + impact
- Buttons: Approve / Deny / Approve All Future Similar
- On approve: tool executes; agent resumes
- On deny: agent receives "denied" result; can adjust

D. Batch approval pattern:
"Approve once" vs "Approve all future create_task calls in this run"
"Approve all future create_task calls forever" (per-customer setting)

E. Rate-limit approvals:
- If agent asks for >10 destructive approvals in 1 hour: alert (something's wrong)

F. Customer-set policies:
Settings page where customer admin defines policies:
- "My agent can never send emails to external addresses"
- "My agent can never delete more than 5 items in 1 hour"
- "My agent must ask approval for any action affecting >$10 of resources"

G. Sandboxing destructive actions:
- Pre-flight: simulate the action without committing
- Show customer: "Here's what would happen"
- Customer confirms; THEN commit

H. Audit log:
Every tool invocation logged with:
- Customer + agent_run_id
- Tool + input + output
- Approval state + approver
- Timestamp

Audit log is queryable + exportable (compliance).

I. "Pause-on-anomaly":
- Detect: agent making 100x normal tool calls
- Auto-pause; alert customer + your ops
- Human reviews before resume

Implement:
1. Tool-level requires_approval flagging
2. Approval UI (modal + queue page)
3. Batch approval logic
4. Per-customer safety policies
5. Pre-flight simulation for destructive actions
6. Audit log + export
7. Anomaly detection + auto-pause

Output: safety customers actually trust.

6. Build evaluations (regression tests for agents)

Without evals, every prompt change is a roll of the dice. Build evals from day one.

Eval types:

A. Golden-task tests:
- Curated set of 20-50 representative tasks
- Each task has: input + expected outcome (or grading rubric)
- Run nightly against staging
- Score: % of golden tasks passing
- Alert when score drops

B. Trajectory evals:
- For multi-step agents, evaluate the PATH not just the destination
- Did agent take the most efficient path?
- Did it make unnecessary tool calls?
- Score: cost efficiency, step count, redundancy

C. Failure analysis:
- Capture every failed run
- Cluster by failure type
- Tracking: "infinite loops", "wrong tool chosen", "hallucinated parameters"

D. A/B testing in production:
- Run new agent prompt vs old on subset of traffic
- Measure: customer satisfaction, completion rate, cost
- Promote winner

E. Human evaluation:
- Sample 50 runs/week; humans review
- Quality bar: 90%+ "good"; below that, fix prompts

Implementation:

evals_test_cases (
  id              uuid pk
  category        text  -- 'planning', 'tool-use', 'common-task'
  task_text       text
  expected_outcome jsonb  -- structured; may include specific tool calls expected, fields
  grading_rubric  jsonb   -- for LLM-as-judge grading
  difficulty      text    -- 'easy' | 'medium' | 'hard'
  enabled         bool default true
)

evals_run_results (
  id              uuid pk
  test_case_id    uuid
  agent_version   text
  model           text
  passed          bool
  score           numeric
  details         jsonb
  ran_at          timestamptz
)

LLM-as-judge for unstructured outcomes:

async function gradeRunWithLLM(run: AgentRun, expected: ExpectedOutcome): Promise<{ passed: boolean; score: number }> {
  const prompt = `
    Task: ${run.task}
    Expected outcome: ${expected}
    Actual agent output: ${run.result}
    
    Grade on a scale of 0-10:
    - 10: perfectly accomplished the task
    - 5: partially accomplished but with errors
    - 0: completely failed
    
    Return JSON: { score: number, reasoning: string }
  `
  const result = await llm.generateObject({ prompt, schema: gradingSchema })
  return { passed: result.score >= 7, score: result.score }
}

Eval cadence:
- Nightly: full golden-task suite
- Per-PR: subset of fast tests (5 mins)
- Per release: full + LLM-judge + human spot-check
- Weekly: failure analysis + cluster review

Implement:
1. The evals test cases schema
2. The eval runner
3. LLM-as-judge integration
4. CI integration (block PRs that drop eval score)
5. Dashboards + alerts
6. Weekly failure-cluster review process

Output: confidence that prompt changes don't regress.

7. Cost controls + customer billing

Agents can burn money fast. Customers want predictability.

Cost layers:

A. Per-run budget:
- Customer-set or default ($0.50 default per run)
- Hard stop when exceeded
- "Budget exceeded" message in UI

B. Per-customer monthly budget:
- Customer plan tier sets monthly LLM budget
- Free tier: $5/mo
- Pro: $50/mo
- Enterprise: custom
- Soft alert at 75%, 90%, 100%

C. Per-task daily budget:
- Some agents run frequently (background); cap daily
- Free tier: 5 runs/day; Pro: 50; Enterprise: unlimited

D. Cost attribution:
- Each run tagged: which feature, which customer
- Surface in customer-facing analytics

Pricing models for agents:

1. INCLUDED IN PLAN
- Agent usage bundled in subscription
- Predictable cost
- You absorb LLM cost; price plans accordingly

2. METERED USAGE
- Customer pays per agent run or per token
- Stripe usage-based billing
- Predictable for customer

3. CREDIT-BASED
- Customer buys credits; agent runs consume credits
- Easier to communicate ("you have 100 agent runs left")
- Pre-paid revenue

4. HYBRID
- Plans include N runs free; overage metered
- Most common in 2026

Cost tracking:

agent_run_costs (
  run_id          uuid
  step_number     int
  model           text  -- 'claude-sonnet-4-7', 'gpt-5-mini'
  input_tokens    int
  output_tokens   int
  cached_tokens   int  -- prompt-cache hit
  unit_cost_input numeric  -- USD per token
  unit_cost_output numeric
  total_cost_usd  numeric
  PRIMARY KEY (run_id, step_number)
)

Cost optimization:
- Prompt caching (Anthropic / OpenAI): 50-90% savings on repeated context
- Use cheaper model for simple steps (Haiku for tool decisions; Sonnet for hard reasoning)
- Aggressive context pruning (don't pass full history to every step)
- Reuse tool results across runs (cache)

Customer-facing analytics:
- "This agent has run X times this month"
- "Cost: $Y of $Z budget"
- "Average cost per run: $W"

Implement:
1. The cost-tracking schema + ingestion
2. Per-run + per-customer + per-tier budget enforcement
3. Cost attribution to customer/feature
4. Customer-facing usage analytics
5. Pricing model + Stripe usage-based billing integration (if metered)
6. Cost optimization: prompt caching + tiered models

Output: agents that don't blow up the bill.

8. Observability + debugging

When agents fail, customers ask "why?" — give them a great answer.

Run-detail UI:

Per agent run, show:
- Task + timestamp
- Status (in progress / completed / failed / awaiting approval)
- Total duration
- Total cost ($X of $Y budget)
- Step-by-step expandable timeline:
  - Step 1: thinking + tool call + tool result
  - Step 2: thinking + tool call + tool result
  - ...
- Failure reason if applicable
- Retry button (re-run from current state OR from scratch)

Filtering + search:
- Filter by status, customer, date
- Search by task text
- Filter by tool used

Debugging tools:
- Step-through replay (animate through the agent's decision)
- Compare runs (run A vs B)
- "Why didn't the agent do X?" — semantic search through reasoning

Internal admin tools:
- Top failing tasks (cluster by error type)
- Cost outliers (which runs cost > $5?)
- Slowest runs
- Customer-facing-error rate

Operational alerts:
- High failure rate per agent type (>10% in 1 hour)
- Cost spike (10x normal in 1 hour)
- Approval queue backing up (50+ pending)
- Specific tool calling > 10x normal rate

Implement:
1. The run-detail UI for customers
2. The internal admin tools
3. The replay / compare functionality
4. The alerting infrastructure
5. Slack/email/PagerDuty integration for alerts

Output: customers can self-debug; reduces support load.

9. Edge cases + operational realities

Walk me through:

1. Agent loops forever
- Detection: same tool called 5x in a row with same input
- Action: break out; mark failed; alert
- Customer message: "Agent appears to be stuck"

2. Agent calls non-existent tool
- LLM hallucinates a tool name
- Detection: tool_name not in catalog
- Action: return error to LLM; let it retry
- After 2 such errors: terminate run

3. LLM rate limited
- Anthropic / OpenAI returns 429
- Retry with exponential backoff
- Use Vercel AI Gateway for fallback to second model

4. LLM provider outage
- Detect: > 3 consecutive errors
- Fallback to second provider (Anthropic → OpenAI or vice versa)
- AI Gateway makes this transparent

5. Customer cancels mid-run
- Mark status='cancelled'
- Workflow runtime stops (durable cancellation)
- Refund any in-progress cost (per pricing model)

6. Tool execution fails (e.g., DB down)
- Don't retry within agent loop (LLM doesn't know better)
- Retry in tool implementation with backoff
- After exhausted: return structured error to agent

7. Agent produces unexpected destructive plan
- Pre-flight detection: tool calls would delete > N items
- Auto-require approval for batch destructive ops
- Customer can preview impact before approving

8. Agent leaks information across customers
- CRITICAL: enforce customer_id at every tool call
- Tools verify caller has access
- Audit-test: simulate agent for customer A trying to access customer B's data

9. Customer's agent budget runs out mid-day
- Agent runs queued; status='budget_exceeded'
- Customer notified; option to top-up
- Don't auto-charge unless plan supports it

10. Customer requests agent run history for compliance
- Export all runs + steps + costs for date range
- CSV / JSON export
- 90-day retention default; 1-year for paid; 7 years for compliance-tier customers

11. New LLM model released
- Don't auto-switch (regression risk)
- Run evals against new model
- Promote if eval score equal or better
- Document model change in customer-facing changelog

12. Customer wants to extend agent with custom tools
- Don't enable arbitrary code execution (security)
- Provide: webhook tool (custom URL); workflow tool (compose existing tools)
- Sandbox via [Vercel Sandbox / E2B] for true custom code

13. Agent produces output that violates customer's policy
- Customer-specific content rules (e.g. "never mention competitor X")
- Post-generation filter
- Re-run if filter triggers

14. Agent run takes too long (3+ hours)
- Hard timeout; mark failed
- Refund partial cost
- Alert customer + ops

For each: code change + customer comms + ops impact.

Output: agents that survive real-world conditions.

10. Recap

What you've built:

What you're explicitly NOT shipping in v1:

Multi-agent orchestration (agents calling other agents)
Code execution sandbox in agent (defer to Vercel Sandbox / E2B)
Persistent agent memory across runs (defer)
Customer-defined custom tools beyond webhook (defer)
Agent-as-user (full identity proxy) (defer)
Federated agents across multi-tenant boundaries (anti-pattern)
Reinforcement learning fine-tuning of agent behavior (defer)

Ship v1: Shape 1 or 2 (copilot or scoped executor) with strong safety. Earn customer trust. Add Shape 3 (autonomous loop) for SPECIFIC high-value tasks. Don't pre-build Shape 4-5.

The biggest mistake teams make: shipping autonomous agents (Shape 3+) before earning trust with copilot/scoped (Shape 1-2). Customers panic when an agent "did the wrong thing" without approval.

The second mistake: skipping evaluations. Every prompt change is a regression risk. Build evals from day one.

The third mistake: ignoring cost. Agents at scale can cost $10-100/customer/month if unbounded. Per-run budget + per-customer monthly budget are non-negotiable.

The fourth mistake: tools without permission checks. The agent acts AS A user; tools must enforce that user's permissions. One leak across tenants = career-ending.