In-Product AI Agent Implementation — Chat Prompts
If you're a B2B SaaS in 2026 and customers are asking "can it just do this for me?" — you're hitting the moment where in-product AI agents become a differentiator. Not a chatbot ("ask me a question"); not a copilot ("help me write this"); a real agent — autonomous, multi-step, takes actions in your product, executes work that previously required a human. Linear's auto-triage, Notion's AI agents, Stripe's Optimized Checkout, HubSpot's Breeze Agents, Cursor's Background Agents, Anthropic's Claude Code — every modern SaaS in 2026 is shipping this shape. The competitive frontier moved from "we have AI features" to "we have an agent that does X for the customer."
The naive shape: "we'll just call OpenAI and it'll figure it out." This works for a 2-step demo and breaks at production scale: agents loop forever, hit rate limits, take destructive actions you didn't expect, blow through budget, fail mysteriously, and produce inconsistent results. The right shape: an agent loop with bounded tool calls, durable execution, structured outputs, evaluation harness, cost controls, safety review for destructive actions, audit logging, and a graceful degradation path. This is hard engineering — the gap between "demo agent" and "production agent customers trust" is months of work.
This chat walks through implementing a production-grade in-product AI agent: the loop architecture, tool design, durable execution, safety/sandboxing, evaluations, cost controls, observability, and the operational realities of running customer-defined agents at scale.
What you're building
- An agent loop (LLM + tool-call + iterate) with bounded depth
- A tool catalog (the actions the agent can take in your product)
- Tool implementations with proper auth + scope + safety
- Durable execution (resume on crash; long-running)
- Streaming output to UI (visible progress for the customer)
- Cost controls (per-customer, per-task budget)
- Observability (every step logged + replayable)
- Evaluations (regression-test agent behavior)
- Safety / approval flows for destructive actions
- Customer-facing analytics (this agent ran X times, succeeded Y%, cost $Z)
1. Decide the agent shape FIRST
Help me decide what shape of agent I'm building.
Five common in-product agent shapes — pick consciously:
SHAPE 1: COPILOT (suggestion-only; human approves)
- Agent suggests; user clicks accept
- Examples: Cursor in-line suggestions, Notion AI, Linear copy AI
- Pros: low risk; high transparency; trust builds gradually
- Cons: incremental productivity gain; not autonomous
SHAPE 2: SCOPED EXECUTOR (one-shot task; defined boundary)
- User says "do this specific thing"; agent does it; reports back
- Examples: "Generate report"; "Summarize this thread"; "Draft this email"
- Pros: bounded scope; low risk
- Cons: customer must invoke each time; not proactive
SHAPE 3: AUTONOMOUS LOOP (multi-step; takes actions; reports)
- Customer assigns goal; agent plans + acts + adjusts; reports outcome
- Examples: Linear Triage, HubSpot Breeze Agents, Stripe payment optimization
- Pros: real productivity; "set and forget"
- Cons: bigger surface area; safety / cost / loop risks
SHAPE 4: BACKGROUND AGENT (continuous; monitors + acts)
- Agent runs in background watching for triggers
- Examples: Cursor background agents, Vercel Agent investigations
- Pros: proactive; finds issues humans miss
- Cons: ongoing cost; harder to debug
SHAPE 5: AGENT-AS-USER (full proxy)
- Agent has its own identity in the product; takes actions as itself
- Examples: Devin AI assigned to repo; Claude Code as PR reviewer
- Pros: ultimate autonomy
- Cons: most safety-critical; legal/audit complexity
DEFAULT FOR MOST SaaS IN 2026:
- v1: Shape 1 (Copilot) for first AI feature
- v2: Shape 2 (Scoped Executor) for specific high-value tasks
- v3: Shape 3 (Autonomous Loop) when v2 has earned customer trust
- Don't pre-build Shape 4-5 without specific buyer demand
Which to build FIRST:
- New to AI features: Shape 1
- Already have AI features; ready to expand: Shape 2 → Shape 3
- Power users asking for autonomy: Shape 3 (with strong safety controls)
Output: explicit shape choice + scope statement + what's NOT in v1.
Output: scope statement preventing scope creep.
2. Design the agent loop architecture
For Shape 3 (autonomous loop) — the most common 2026 implementation — here's the architecture.
Core loop:
while !done && step_count < max_steps:
1. Build context window (system prompt + tools + history + state)
2. Call LLM
3. Receive response (text + optional tool_calls)
4. If tool_calls:
a. Validate each tool call (schema, permissions, safety)
b. Execute tool calls (in parallel if independent; serially if dependent)
c. Append results to history
5. If text response (no tool calls): consider this step "done" or final
6. Increment step counter
7. Check stopping condition: success / max-steps / budget-exceeded / user-cancelled
Implementation building blocks:
Type definitions:
type AgentStep = {
step_number: number
thinking?: string // model's reasoning if extended-thinking enabled
tool_calls?: ToolCall[]
tool_results?: ToolResult[]
text_output?: string
cost: { input_tokens: number; output_tokens: number; usd: number }
duration_ms: number
}
type AgentRun = {
id: string
customer_id: string
task: string
status: 'pending' | 'running' | 'awaiting_approval' | 'completed' | 'failed' | 'cancelled' | 'budget_exceeded'
steps: AgentStep[]
total_cost_usd: number
budget_usd: number
max_steps: number
started_at: Date
completed_at?: Date
result?: any
}
Loop pseudo-code (TypeScript / Vercel AI SDK / your stack):
async function runAgent(run: AgentRun, options: AgentOptions): Promise<AgentRun> {
while (run.status === 'running' && run.steps.length < options.max_steps) {
if (run.total_cost_usd >= run.budget_usd) {
run.status = 'budget_exceeded'
break
}
const stepNumber = run.steps.length + 1
const startTime = Date.now()
const result = await streamText({
model: 'anthropic/claude-sonnet-4-7', // via Vercel AI Gateway
system: buildSystemPrompt(run),
messages: buildMessageHistory(run),
tools: getToolsForCustomer(run.customer_id),
maxOutputTokens: 4096,
})
const step: AgentStep = {
step_number: stepNumber,
tool_calls: result.toolCalls,
tool_results: [],
text_output: result.text,
cost: computeCost(result.usage, model),
duration_ms: Date.now() - startTime,
}
run.total_cost_usd += step.cost.usd
// Execute tool calls (with safety + auth checks)
if (result.toolCalls.length > 0) {
step.tool_results = await executeToolCalls(result.toolCalls, run.customer_id)
}
run.steps.push(step)
await persistRun(run) // durable; resumable
// Stop conditions
if (!result.toolCalls.length && result.text) {
run.status = 'completed'
run.result = result.text
break
}
// Check for "approval required" tool results
if (step.tool_results.some(r => r.requires_approval)) {
run.status = 'awaiting_approval'
break
}
}
if (run.steps.length >= options.max_steps) {
run.status = 'failed'
run.result = 'max_steps_exceeded'
}
return run
}
Key design decisions:
1. Max steps (typical 10-30)
- Prevents infinite loops
- Set per agent type (simple agents: 5; complex: 20)
- Customer-facing message when hit
2. Budget per run (typical $0.10 - $5.00)
- Prevents runaway cost
- Track per-customer monthly aggregate too
- Hard stop when exceeded
3. Persistence (every step)
- Save state to DB at each iteration
- Allows resume on crash
- Allows replay for debugging
4. Streaming UI
- Stream tokens to UI for visible progress
- Update task status as steps complete
- Customer sees the agent "thinking"
5. Approval flow
- Some tools require human approval (delete, send email, charge card)
- Pause agent; notify user; resume on approval
Walk me through:
1. The full agent loop implementation
2. The persistence layer (how state survives crashes)
3. The streaming-to-UI integration
4. The cost tracking + budget enforcement
5. The max-steps + termination handling
6. The approval pause/resume flow
Output: an agent loop that survives production conditions.
3. Design the tool catalog
Tools are how your agent ACTS. Designing them is the most important part.
Tool design principles:
1. Each tool does ONE thing (single responsibility)
- Bad: "manage_task" (too vague; agent confused about what it does)
- Good: "create_task", "update_task_status", "assign_task" (each clear)
2. Tools have STRUCTURED outputs
- Bad: tool returns "Task created successfully"
- Good: tool returns { task_id, title, status, assignee }
3. Tools VALIDATE input
- Use Zod / Pydantic / your schema lib
- Reject invalid input with clear error message
- Don't crash on malformed input
4. Tools enforce PERMISSIONS
- The agent acts AS A specific user
- Tool checks: "can this user do this?"
- If no: return error; don't execute
5. Tools are IDEMPOTENT where possible
- Re-running same tool with same input → same result
- Use idempotency keys for create operations
6. Tools have CLEAR descriptions
- LLM reads description to decide when to call
- Bad: "Creates things"
- Good: "Create a new task in a project. Returns the new task. Use this when the user wants to add work to a project."
7. Tools handle ERRORS gracefully
- Don't throw exceptions to the agent loop (will loop trying the same thing)
- Return structured error: { error: "permission_denied", message: "User cannot create tasks in this project" }
Sample tool catalog for a project-management SaaS:
tool_catalog = [
{
name: 'list_projects',
description: 'List the user\'s active projects. Returns up to 50.',
input_schema: { workspace_id: 'uuid' },
output_schema: { projects: [{ id, name, status, member_count }] },
side_effect: false,
requires_approval: false,
},
{
name: 'list_tasks_in_project',
description: 'List tasks in a specific project. Filters: status, assignee, due-date.',
input_schema: { project_id, filters?: {status?, assignee?, due_before?} },
output_schema: { tasks: [{...}] },
side_effect: false,
requires_approval: false,
},
{
name: 'create_task',
description: 'Create a new task in a project.',
input_schema: { project_id, title, description?, assignee_id?, due_date?, priority? },
output_schema: { task: {...} },
side_effect: true,
requires_approval: false,
},
{
name: 'update_task_status',
description: 'Change the status of an existing task (e.g., todo → in_progress → done).',
input_schema: { task_id, new_status },
output_schema: { task: {...} },
side_effect: true,
requires_approval: false,
},
{
name: 'assign_task',
description: 'Assign a task to a user.',
input_schema: { task_id, user_id },
output_schema: { task: {...} },
side_effect: true,
requires_approval: false,
},
{
name: 'delete_task',
description: 'Delete a task. DESTRUCTIVE - requires user approval.',
input_schema: { task_id },
output_schema: { task_id, deleted: true },
side_effect: true,
requires_approval: true,
},
{
name: 'send_message_to_user',
description: 'Send a direct message to a user in the workspace. DESTRUCTIVE - requires approval.',
input_schema: { user_id, message },
output_schema: { message_id },
side_effect: true,
requires_approval: true,
},
// ... etc
]
Read tools (no side effects): never require approval; agent should call freely.
Write tools: case-by-case; non-destructive can be auto; destructive should require approval.
Tool authoring guide:
A. Permissions check first:
async function executeListTasksInProject(input, context) {
const project = await db.projects.findById(input.project_id)
if (!project) return { error: 'not_found' }
if (!await canRead(context.user_id, project)) {
return { error: 'permission_denied', message: 'No access to this project' }
}
// ... actual logic
}
B. Validation:
const schema = z.object({
project_id: z.string().uuid(),
filters: z.object({
status: z.enum(['todo', 'in_progress', 'done']).optional(),
assignee: z.string().uuid().optional(),
}).optional(),
})
const parsed = schema.safeParse(input)
if (!parsed.success) return { error: 'invalid_input', details: parsed.error }
C. Logging:
Every tool execution logged with input + output + duration + agent_run_id
For debugging + audit + replay
Implement:
1. The tool definition format + registry
2. Schema validation per tool
3. Permission checks per tool
4. The tool executor (dispatches by name)
5. Tool result types (success / error structures)
6. The tool documentation page (for customer transparency: "your agent can do these things")
Output: a tool catalog the agent can compose into useful workflows.
4. Implement durability (resumable; crash-safe)
Real agents run for minutes (sometimes hours). Your worker may crash. Your DB may hiccup. The customer may close their browser. Agent must resume.
Durability strategy: every step persists to DB; agent worker is stateless.
Schema:
agent_runs (
id uuid pk
customer_id uuid
task text
status text
budget_usd numeric
total_cost_usd numeric
max_steps int
config jsonb -- model, system prompt, tool subset
result jsonb
error text
created_at timestamptz
started_at timestamptz
completed_at timestamptz
-- updated each step
current_step int default 0
paused_reason text -- 'awaiting_approval' | 'budget_warning' | etc.
)
agent_run_steps (
run_id uuid not null
step_number int not null
thinking text
tool_calls jsonb
tool_results jsonb
text_output text
input_tokens int
output_tokens int
cost_usd numeric
duration_ms int
occurred_at timestamptz
PRIMARY KEY (run_id, step_number)
)
agent_run_tool_invocations ( -- normalized; for searchability
id uuid pk
run_id uuid
step_number int
tool_name text
input jsonb
output jsonb
status text -- 'success' | 'error' | 'pending_approval'
duration_ms int
occurred_at timestamptz
)
Worker design:
Use a durable workflow runtime: Inngest, Trigger.dev, Temporal, or Vercel Workflow (in 2026 — Workflow DevKit is GA).
Vercel Workflow example:
import { workflow } from '@vercel/workflow-devkit'
export const runAgentWorkflow = workflow({
name: 'run-agent',
run: async (ctx, { run_id }) => {
const run = await ctx.step('load', () => db.agentRuns.findById(run_id))
while (run.status === 'running' && run.current_step < run.max_steps) {
// Each step is a durable atom
const step = await ctx.step(`step-${run.current_step + 1}`, async () => {
return await executeAgentStep(run)
})
run.current_step += 1
run.total_cost_usd += step.cost_usd
await ctx.step('persist', () => db.agentRuns.update(run))
if (step.requires_approval) {
await ctx.waitForEvent(`approval-${run_id}`, { timeout: '7 days' })
// Resumed when user approves
}
if (run.total_cost_usd >= run.budget_usd) {
run.status = 'budget_exceeded'
break
}
if (step.is_final) {
run.status = 'completed'
break
}
}
return run
},
})
Key durability patterns:
1. Each step is a durable atom
- Worker crashes mid-step → re-execute that step (idempotent)
- Each step result persisted before continuing
2. Tool calls are NOT idempotent by default
- create_task called twice creates two tasks
- Use idempotency keys at tool layer (e.g., create_task_idempotency_key = hash(run_id, step_number))
- Tool implementations check idempotency key before creating
3. Approval pauses don't cost compute
- ctx.waitForEvent durably suspends execution
- Resume on event receipt
- 7-day timeout (configurable)
4. Cost tracking persists per step
- Even if final result fails, you have the cost trail
- Helpful for debugging "why did this cost $40?"
5. Long-running workflows: use cron + chunking
- Don't run a 4-hour agent in one workflow
- Chunk: workflow runs 30 min; persists state; cron resumes
Implement:
1. The agent_runs schema
2. The durable workflow runner (Vercel Workflow / Inngest / Temporal)
3. Tool-level idempotency keys
4. The approval-waiting pattern
5. The chunking pattern for long agents
6. The crash-recovery testing
Output: agents that survive production.
5. Implement safety + approval flow
For Shape 3+ agents, customer trust depends on safety. Get this right.
Safety layers:
A. Tool-level approval requirement:
- Each tool flagged: requires_approval = true | false
- Read tools: never approval
- Non-destructive writes: usually no approval
- Destructive operations: ALWAYS approval
B. Risk classification at tool definition:
type RiskLevel = 'safe' | 'cautious' | 'destructive' | 'irreversible'
- safe: read-only, idempotent
- cautious: write but reversible (create_task can be deleted)
- destructive: write that's hard to undo (delete_task; can be undeleted within 30 days)
- irreversible: cannot be undone (send_email externally; charge_card; production_deploy)
C. Approval UX:
When tool requires approval:
- Agent pauses
- UI shows: "Agent wants to do X" with description of action + impact
- Buttons: Approve / Deny / Approve All Future Similar
- On approve: tool executes; agent resumes
- On deny: agent receives "denied" result; can adjust
D. Batch approval pattern:
"Approve once" vs "Approve all future create_task calls in this run"
"Approve all future create_task calls forever" (per-customer setting)
E. Rate-limit approvals:
- If agent asks for >10 destructive approvals in 1 hour: alert (something's wrong)
F. Customer-set policies:
Settings page where customer admin defines policies:
- "My agent can never send emails to external addresses"
- "My agent can never delete more than 5 items in 1 hour"
- "My agent must ask approval for any action affecting >$10 of resources"
G. Sandboxing destructive actions:
- Pre-flight: simulate the action without committing
- Show customer: "Here's what would happen"
- Customer confirms; THEN commit
H. Audit log:
Every tool invocation logged with:
- Customer + agent_run_id
- Tool + input + output
- Approval state + approver
- Timestamp
Audit log is queryable + exportable (compliance).
I. "Pause-on-anomaly":
- Detect: agent making 100x normal tool calls
- Auto-pause; alert customer + your ops
- Human reviews before resume
Implement:
1. Tool-level requires_approval flagging
2. Approval UI (modal + queue page)
3. Batch approval logic
4. Per-customer safety policies
5. Pre-flight simulation for destructive actions
6. Audit log + export
7. Anomaly detection + auto-pause
Output: safety customers actually trust.
6. Build evaluations (regression tests for agents)
Without evals, every prompt change is a roll of the dice. Build evals from day one.
Eval types:
A. Golden-task tests:
- Curated set of 20-50 representative tasks
- Each task has: input + expected outcome (or grading rubric)
- Run nightly against staging
- Score: % of golden tasks passing
- Alert when score drops
B. Trajectory evals:
- For multi-step agents, evaluate the PATH not just the destination
- Did agent take the most efficient path?
- Did it make unnecessary tool calls?
- Score: cost efficiency, step count, redundancy
C. Failure analysis:
- Capture every failed run
- Cluster by failure type
- Tracking: "infinite loops", "wrong tool chosen", "hallucinated parameters"
D. A/B testing in production:
- Run new agent prompt vs old on subset of traffic
- Measure: customer satisfaction, completion rate, cost
- Promote winner
E. Human evaluation:
- Sample 50 runs/week; humans review
- Quality bar: 90%+ "good"; below that, fix prompts
Implementation:
evals_test_cases (
id uuid pk
category text -- 'planning', 'tool-use', 'common-task'
task_text text
expected_outcome jsonb -- structured; may include specific tool calls expected, fields
grading_rubric jsonb -- for LLM-as-judge grading
difficulty text -- 'easy' | 'medium' | 'hard'
enabled bool default true
)
evals_run_results (
id uuid pk
test_case_id uuid
agent_version text
model text
passed bool
score numeric
details jsonb
ran_at timestamptz
)
LLM-as-judge for unstructured outcomes:
async function gradeRunWithLLM(run: AgentRun, expected: ExpectedOutcome): Promise<{ passed: boolean; score: number }> {
const prompt = `
Task: ${run.task}
Expected outcome: ${expected}
Actual agent output: ${run.result}
Grade on a scale of 0-10:
- 10: perfectly accomplished the task
- 5: partially accomplished but with errors
- 0: completely failed
Return JSON: { score: number, reasoning: string }
`
const result = await llm.generateObject({ prompt, schema: gradingSchema })
return { passed: result.score >= 7, score: result.score }
}
Eval cadence:
- Nightly: full golden-task suite
- Per-PR: subset of fast tests (5 mins)
- Per release: full + LLM-judge + human spot-check
- Weekly: failure analysis + cluster review
Implement:
1. The evals test cases schema
2. The eval runner
3. LLM-as-judge integration
4. CI integration (block PRs that drop eval score)
5. Dashboards + alerts
6. Weekly failure-cluster review process
Output: confidence that prompt changes don't regress.
7. Cost controls + customer billing
Agents can burn money fast. Customers want predictability.
Cost layers:
A. Per-run budget:
- Customer-set or default ($0.50 default per run)
- Hard stop when exceeded
- "Budget exceeded" message in UI
B. Per-customer monthly budget:
- Customer plan tier sets monthly LLM budget
- Free tier: $5/mo
- Pro: $50/mo
- Enterprise: custom
- Soft alert at 75%, 90%, 100%
C. Per-task daily budget:
- Some agents run frequently (background); cap daily
- Free tier: 5 runs/day; Pro: 50; Enterprise: unlimited
D. Cost attribution:
- Each run tagged: which feature, which customer
- Surface in customer-facing analytics
Pricing models for agents:
1. INCLUDED IN PLAN
- Agent usage bundled in subscription
- Predictable cost
- You absorb LLM cost; price plans accordingly
2. METERED USAGE
- Customer pays per agent run or per token
- Stripe usage-based billing
- Predictable for customer
3. CREDIT-BASED
- Customer buys credits; agent runs consume credits
- Easier to communicate ("you have 100 agent runs left")
- Pre-paid revenue
4. HYBRID
- Plans include N runs free; overage metered
- Most common in 2026
Cost tracking:
agent_run_costs (
run_id uuid
step_number int
model text -- 'claude-sonnet-4-7', 'gpt-5-mini'
input_tokens int
output_tokens int
cached_tokens int -- prompt-cache hit
unit_cost_input numeric -- USD per token
unit_cost_output numeric
total_cost_usd numeric
PRIMARY KEY (run_id, step_number)
)
Cost optimization:
- Prompt caching (Anthropic / OpenAI): 50-90% savings on repeated context
- Use cheaper model for simple steps (Haiku for tool decisions; Sonnet for hard reasoning)
- Aggressive context pruning (don't pass full history to every step)
- Reuse tool results across runs (cache)
Customer-facing analytics:
- "This agent has run X times this month"
- "Cost: $Y of $Z budget"
- "Average cost per run: $W"
Implement:
1. The cost-tracking schema + ingestion
2. Per-run + per-customer + per-tier budget enforcement
3. Cost attribution to customer/feature
4. Customer-facing usage analytics
5. Pricing model + Stripe usage-based billing integration (if metered)
6. Cost optimization: prompt caching + tiered models
Output: agents that don't blow up the bill.
8. Observability + debugging
When agents fail, customers ask "why?" — give them a great answer.
Run-detail UI:
Per agent run, show:
- Task + timestamp
- Status (in progress / completed / failed / awaiting approval)
- Total duration
- Total cost ($X of $Y budget)
- Step-by-step expandable timeline:
- Step 1: thinking + tool call + tool result
- Step 2: thinking + tool call + tool result
- ...
- Failure reason if applicable
- Retry button (re-run from current state OR from scratch)
Filtering + search:
- Filter by status, customer, date
- Search by task text
- Filter by tool used
Debugging tools:
- Step-through replay (animate through the agent's decision)
- Compare runs (run A vs B)
- "Why didn't the agent do X?" — semantic search through reasoning
Internal admin tools:
- Top failing tasks (cluster by error type)
- Cost outliers (which runs cost > $5?)
- Slowest runs
- Customer-facing-error rate
Operational alerts:
- High failure rate per agent type (>10% in 1 hour)
- Cost spike (10x normal in 1 hour)
- Approval queue backing up (50+ pending)
- Specific tool calling > 10x normal rate
Implement:
1. The run-detail UI for customers
2. The internal admin tools
3. The replay / compare functionality
4. The alerting infrastructure
5. Slack/email/PagerDuty integration for alerts
Output: customers can self-debug; reduces support load.
9. Edge cases + operational realities
Walk me through:
1. Agent loops forever
- Detection: same tool called 5x in a row with same input
- Action: break out; mark failed; alert
- Customer message: "Agent appears to be stuck"
2. Agent calls non-existent tool
- LLM hallucinates a tool name
- Detection: tool_name not in catalog
- Action: return error to LLM; let it retry
- After 2 such errors: terminate run
3. LLM rate limited
- Anthropic / OpenAI returns 429
- Retry with exponential backoff
- Use Vercel AI Gateway for fallback to second model
4. LLM provider outage
- Detect: > 3 consecutive errors
- Fallback to second provider (Anthropic → OpenAI or vice versa)
- AI Gateway makes this transparent
5. Customer cancels mid-run
- Mark status='cancelled'
- Workflow runtime stops (durable cancellation)
- Refund any in-progress cost (per pricing model)
6. Tool execution fails (e.g., DB down)
- Don't retry within agent loop (LLM doesn't know better)
- Retry in tool implementation with backoff
- After exhausted: return structured error to agent
7. Agent produces unexpected destructive plan
- Pre-flight detection: tool calls would delete > N items
- Auto-require approval for batch destructive ops
- Customer can preview impact before approving
8. Agent leaks information across customers
- CRITICAL: enforce customer_id at every tool call
- Tools verify caller has access
- Audit-test: simulate agent for customer A trying to access customer B's data
9. Customer's agent budget runs out mid-day
- Agent runs queued; status='budget_exceeded'
- Customer notified; option to top-up
- Don't auto-charge unless plan supports it
10. Customer requests agent run history for compliance
- Export all runs + steps + costs for date range
- CSV / JSON export
- 90-day retention default; 1-year for paid; 7 years for compliance-tier customers
11. New LLM model released
- Don't auto-switch (regression risk)
- Run evals against new model
- Promote if eval score equal or better
- Document model change in customer-facing changelog
12. Customer wants to extend agent with custom tools
- Don't enable arbitrary code execution (security)
- Provide: webhook tool (custom URL); workflow tool (compose existing tools)
- Sandbox via [Vercel Sandbox / E2B] for true custom code
13. Agent produces output that violates customer's policy
- Customer-specific content rules (e.g. "never mention competitor X")
- Post-generation filter
- Re-run if filter triggers
14. Agent run takes too long (3+ hours)
- Hard timeout; mark failed
- Refund partial cost
- Alert customer + ops
For each: code change + customer comms + ops impact.
Output: agents that survive real-world conditions.
10. Recap
What you've built:
- Agent loop with bounded steps + budget
- Tool catalog with permissions + validation
- Durable execution (resume on crash)
- Streaming UI (visible progress)
- Approval flow for destructive actions
- Customer-set safety policies
- Evaluation harness (regression tests)
- Cost tracking + budgets (per-run, per-customer)
- Run-detail UI for customer self-debug
- Operational alerts + admin tools
- Audit log + export
What you're explicitly NOT shipping in v1:
- Multi-agent orchestration (agents calling other agents)
- Code execution sandbox in agent (defer to Vercel Sandbox / E2B)
- Persistent agent memory across runs (defer)
- Customer-defined custom tools beyond webhook (defer)
- Agent-as-user (full identity proxy) (defer)
- Federated agents across multi-tenant boundaries (anti-pattern)
- Reinforcement learning fine-tuning of agent behavior (defer)
Ship v1: Shape 1 or 2 (copilot or scoped executor) with strong safety. Earn customer trust. Add Shape 3 (autonomous loop) for SPECIFIC high-value tasks. Don't pre-build Shape 4-5.
The biggest mistake teams make: shipping autonomous agents (Shape 3+) before earning trust with copilot/scoped (Shape 1-2). Customers panic when an agent "did the wrong thing" without approval.
The second mistake: skipping evaluations. Every prompt change is a regression risk. Build evals from day one.
The third mistake: ignoring cost. Agents at scale can cost $10-100/customer/month if unbounded. Per-run budget + per-customer monthly budget are non-negotiable.
The fourth mistake: tools without permission checks. The agent acts AS A user; tools must enforce that user's permissions. One leak across tenants = career-ending.
See Also
- AI Features Implementation — broader AI feature framework
- AI Streaming Chat UI — streaming pattern this builds on
- LLM Cost Optimization — cost-control patterns
- LLM Quality Monitoring — quality patterns
- In-Product Workflow & Automation Builder — adjacent (rule-based vs agent-based)
- Background Jobs & Queue Management — durable execution
- Audit Logs — pairs for agent audit
- Activity Feed & Timeline — events agents emit
- Roles & Permissions — pairs for tool permissions
- Quotas, Limits & Plan Enforcement — pairs for agent budgets
- Public API — adjacent surface
- OAuth Provider Implementation — adjacent identity
- Schema Validation (Zod) — tool input validation
- Idempotency Patterns — depended-upon discipline
- HTTP Retry & Backoff — depended-upon discipline
- AI Customer Support Agents (Reference) — adjacent (off-the-shelf agents)
- AI Agent Frameworks (Reference) — frameworks to build agents
- AI SDK (Reference) — pairs for streaming + tools
- Vercel AI Gateway (Reference) — model routing + observability
- Vercel Sandbox (Reference) — code execution sandbox