Why Agent Loops Fail in Production

Agent loops fail in production for reasons that have little to do with the model, and everything to do with what happens to their state between iterations.

Agent loops are repeatable workflows in which an AI agent:

observes state
decides what to do next
takes an action
evaluates the result
repeats that cycle until a task is complete

They matter because agent loops move AI from one-off assistance into operational execution that impacts your business: updating records, triggering workflows, approving outputs, and making decisions that affect customers, revenue, compliance, and production systems. As organizations deploy more production AI agents, reliability becomes an architectural requirement, not just a model-quality concern. An agent that reasons correctly can still cause costly business failures if the database can’t preserve consistent state, recover from interruptions, or provide an auditable record of what happened.

This article maps each common loop failure to where it starts in the database and the pattern that prevents it, with working code for both PostgreSQL and CockroachDB.

Why does AI agent reliability depend on the data layer?

On June 7, 2026, Peter Steinberger, the engineer behind OpenClaw, posted two sentences that drew millions of views within a day: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

Boris Cherny, who leads Claude Code at Anthropic, had made the same point onstage days earlier: "I don't prompt Claude anymore. I have loops running. They're the ones prompting Claude and figuring out what to do. My job is to write loops."

Andrej Karpathy made it concrete with autoresearch, a Python loop that edits a training script, runs a short training job on a single GPU, reads the result, and commits the change when the metric improves, with no one in the seat.

Latent Space covered the moment under the headline "Loopcraft: The Art of Stacking Loops."

LangChain's writing on loop engineering laid out a useful way to see the shape: four loops that build on each other:

The agent loop at the foundation, where the model reads state and calls tools until the task is done.
A verification loop above it that grades output against a rubric and sends failing results back with feedback.
An event-driven loop that fires the agent in response to an external signal.
A hill-climbing loop that reads accumulated history to improve future runs

Every one of those pieces covers how to design and stack loops. This article explains what happens when those loops hit a production database.

The Future of Databases for AI Agents (with Cockroach Labs CEO & Co-Founder Spencer Kimball) – A fireside chat exploring what happens to database infrastructure when autonomous agents, not humans, become the primary traffic source.

Why do agent loops break in real deployments but work in a demo?

A demo runs once: The state is clean, the load is light, nothing else is touching the same rows, and nothing crashes in the middle of a write. The loop completes cleanly because the conditions are ideal. Production is the opposite of all of that.

At scale, the same loop runs thousands of times across hundreds of concurrent sessions. Each iteration reads state that another iteration may have just written. Retries replay steps that already partially landed, and crashes interrupt writes halfway through. Approvals sit in pending state while the process restarts around them. Memory that was accurate six hours ago is now stale, and the loop keeps reasoning from it.

None of these failure modes surface in a proof of concept. Every one of them surfaces in live systems , and every one of them starts at the state layer.

The observe-act-evaluate cycle gets all the attention. LangGraph, the OpenAI Agents SDK, and Mastra are all built around it. A loop's reliability, however, is set by what happens to the state it reads and writes on every pass:

A model that reasons correctly over bad state produces bad outcomes.
A loop that retries cleanly over a database without transaction boundaries produces duplicate state.
A hill-climbing loop that accumulates corrupted memory gets worse with each iteration, not better.

How to solve this is a database question, and most production agent architectures are not asking it yet.

Which agent loop failures are actually database problems?

LangChain's four-loop model reveals where databases fail:

At the agent loop level, the model reads state and writes results on every iteration.
The verification loop reads that state, grades it, and writes the grade back, adding another read-and-write cycle on top of the first.
Event-driven loops start agents when external signals arrive, creating bursts where many agents read and write the same database tables at once. .
The hill-climbing loop reads accumulated history to improve future runs, so the quality of every future iteration depends on the integrity of every past write.

7 database failure modes that break production AI agents

Once agent loops move from isolated demos to production systems, their failures stop looking like model-quality problems and start looking like state-management problems. The risk isn’t just that one task fails: A loop can double-apply writes, act on stale memory, lose approval context, or erase the audit trail teams need to recover.

At business scale, those failures can affect customer records, financial workflows, compliance evidence, and the trust teams need to automate high-value work. They also increase operational costs by forcing teams to investigate inconsistent state, replay failed workflows, and manually reconstruct events after an incident. As AI agents take on more business-critical work, preventing these failures becomes as important as improving model quality.

Here are the seven failure modes to design against:

Writes without transaction management. When an agent's tool call fails midway through a sequence of writes, the writes that already landed stay. The loop retries on the next pass and double-applies them. A task is marked complete with no result attached, a balance gets updated twice, a record is created twice. The model reasoned correctly, but the state it wrote into is wrong.
Cascading degradation from bad reads. When an agent reads its own state or another agent's output mid-flight, an inconsistent read feeds the next step with bad input. Because each step treats the prior output as ground truth, the error propagates down the chain instead of surfacing. The further it travels, the more expensive it is to unwind.
Blast radius. A loop with write authority and access to its own backups can delete everything in a single pass. In one documented incident in April 2026, a single agent action deleted a production database and its co-located backups in seconds. The blast radius of a loop is a direct function of what its database role can reach, not of how well the model reasons.
Memory drift. The hill-climbing loop reads accumulated memory to improve, but memory written six hours ago may not reflect the source system now. A verification loop using a stale rubric can keep approving the wrong behavior, compounding the error with each iteration.
The recovery gap. When a loop crashes, a database restore can return data to a clean point. But the loop's position in the workflow, its working memory, and the side effects it already sent downstream do not restore with the database. Bringing all four layers back to a coherent operating state is a highly delayed recovery operation, not a restore.
Approval loss. When approval state is stored in application memory, a restart can erase the review context and leave the workflow stuck, abandoned, or resumed incorrectly.
The audit gap. Application logs are not enough to reconstruct thousands of loop actions; reviewers need a tamper-evident record of what changed, when, by which agent, and under which credentials.

These failures look different on the surface, but they have a common flaw: The loop can’t rely on the data layer to preserve state, constrain access, or reconstruct what happened. A reliable agent loop needs the database to provide those guarantees by design.

What does a database need to support reliable agent loops?

This table maps each loop failure to the database capability needed to prevent it, and to the CockroachDB feature that provides it.

The Architect's Playbook for Building AI-Ready Systems – This guide provides practical architectures and design patterns for building AI-ready systems on distributed SQL, covering vector workloads, real-time consistency, and global scalability.

When does a loop need a transaction?

The test is straightforward: If an interrupted tool call could leave behind valid-looking but incorrect data, wrap the write sequence in a transaction.

A task status update and the result record that explains it
A balance change and the ledger entry that records why
A memory write and the provenance record that makes it verifiable.
A grade from the verification loop and the output it evaluated.

The verification loop makes this especially important, because its output becomes input for future iterations. If the grade and the output it evaluates don’t land together, the loop’s feedback history becomes unreliable. A grade can approve an output that was never saved, or an output can be saved without the grade that approved it.

To the database, those fragmented writes may still look valid. To the loop, they become broken state that gets reused on the next pass. A failed transaction is visible and recoverable, but a partial write that passes schema validation can silently corrupt the loop’s state.

What happens when a loop pauses for human review?

LangChain describes this across three of the four loops:

in the verification loop, a human can act as the grader for sensitive workflows
in the application loop, a human can approve outputs before they return to the end user
in the hill-climbing loop, human review gates improvements before deployment.

Human review only works if the loop can pause at a known checkpoint and resume after a decision arrives. If that checkpoint or approval context lives in application memory, a restart can erase it, causing the loop to resume without context or abandon the task.

The fix is to store pause state in the database. Workflow tables record where the loop paused, approval queues persist the decision being awaited, and changefeeds can notify reviewers when a new request arrives. With that state stored durably, the loop can restart, find its place, and resume safely. The full schema is in the code section below.

How do you audit what a loop did across thousands of iterations?

Application logs aren’t an audit trail: They are mutable, distributed across systems, and built for debugging rather than accountability. Agent traces can help teams understand how a loop behaved, but they don’t prove what changed, which data version the loop used, or which credentials authorized the action.

A production audit trail needs three properties:

Append-only. Once a loop action is recorded, the agent should not be able to modify or delete it. Enforce that constraint with database privileges, not application logic an agent could bypass.

Tamper-evident. Audit records should survive even if the source data is changed, rolled back, or deleted. For stronger separation, use a changefeed to copy committed writes into object-locked storage the agent can’t access.

Action-specific. Each record should tie a specific loop action to the agent, credentials, data version, target row, and timestamp involved. Generic query logs and change history aren’t enough to reconstruct why a loop acted.

The EU AI Act's record-keeping requirements under Articles 12 and 26 point toward this kind of durable, accountable record. The full schema and privilege setup are in the code section below.

Built for AI: Scaling IAM, Metadata Management, and Vector Search on One Database — How leading AI companies unify transactional data, vector embeddings, and agent state on a single distributed SQL platform for governance, identity, and metadata at scale.

The following code examples map each loop failure to the database pattern that addresses it. The standard SQL blocks are tested on PostgreSQL 16. Blocks marked [CockroachDB] use CockroachDB v24.3 syntax that is not part of standard PostgreSQL. Replace the placeholder ids before running the examples, and keep any AS OF SYSTEM TIME timestamp inside the garbage-collection window. Together, these patterns form the operational foundation for reliable AI agents; the following examples show how to implement them.

How do you improve these database patterns? Code 1: Safe retries

The agent writes two rows that must stay consistent: the task status and the result that explains it. Without a transaction, a failure between the two writes leaves a state that looks valid but is not. The idempotent retry runs on a separate task so its result row starts empty.

-- The two tables
CREATE TABLE tasks (
  id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  status     TEXT NOT NULL DEFAULT 'pending'
             CHECK (status IN ('pending','running','complete','failed')),
  agent_id   TEXT NOT NULL,
  updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE task_results (
  task_id      UUID PRIMARY KEY REFERENCES tasks(id),
  output       JSONB NOT NULL,
  completed_at TIMESTAMPTZ DEFAULT now()
);

-- UNSAFE: two independent writes. If the INSERT fails, the task is
-- already 'complete' with no result, and a retry runs the UPDATE again.
UPDATE tasks SET status='complete', updated_at=now() WHERE id='<task-id>';
INSERT INTO task_results (task_id, output) VALUES ('<task-id>','{"summary":"done"}');

-- SAFE: one transaction. Both writes land or neither does.
BEGIN;
  UPDATE tasks SET status='complete', updated_at=now() WHERE id='<task-id>';
  INSERT INTO task_results (task_id, output) VALUES ('<task-id>','{"summary":"done"}');
COMMIT;

-- IDEMPOTENT RETRY: a replayed insert is a no-op. Shown on a fresh task
-- so its result row starts empty and the output below is what you get.
INSERT INTO task_results (task_id, output)
  VALUES ('<task-id-2>','{"summary":"first write"}')
  ON CONFLICT (task_id) DO NOTHING;            -- lands
INSERT INTO task_results (task_id, output)
  VALUES ('<task-id-2>','{"summary":"retry should not land"}')
  ON CONFLICT (task_id) DO NOTHING;            -- no-op
SELECT output FROM task_results WHERE task_id='<task-id-2>';
-- Returns: {"summary": "first write"}

Code 2: Role scoping (blast radius)

The blast radius of a loop is the scope of damage a single iteration can cause, and it’s set by database permissions, not model reasoning. Scope the role to the minimum, and verify the grant set before the agent goes near production. The agent role never receives credentials for backup storage.

CREATE ROLE agent_worker;

-- Grant only what the agent needs
GRANT SELECT, INSERT, UPDATE ON TABLE tasks TO agent_worker;
GRANT SELECT, INSERT ON TABLE task_results TO agent_worker;

-- Deny delete on the tables it writes to
REVOKE DELETE ON TABLE tasks FROM agent_worker;
REVOKE DELETE ON TABLE task_results FROM agent_worker;

-- Verify the grant set
-- [CockroachDB]:
SHOW GRANTS ON TABLE tasks FOR agent_worker;
-- PostgreSQL:
SELECT grantee, table_name, privilege_type
FROM information_schema.role_table_grants
WHERE grantee='agent_worker'
ORDER BY table_name, privilege_type;
-- Expected: SELECT, INSERT, UPDATE on tasks. No DELETE.

Code 3: Temporal reads [CockroachDB]

Read the database as it existed at a past timestamp without running a restore, while the production database stays live. A top-level AS OF SYSTEM TIME on a single table is valid. Comparing current state to past state cannot be one statement, because CockroachDB does not allow AS OF SYSTEM TIME only in a subquery and will not mix a live read and a historical read in the same statement. Stage the snapshot first, then join against it.

-- Read one table as it existed 30 minutes ago. Keep the timestamp
-- inside the GC window (gc.ttlseconds, about 25 hours by default).
SELECT id, status, updated_at
FROM tasks
AS OF SYSTEM TIME '-30m'
WHERE agent_id='agent-001';

-- "Which rows changed since the incident" is two statements.
-- 1) Stage the snapshot at a top-level AS OF SYSTEM TIME.
CREATE TABLE snap_before_incident AS
  SELECT id, status
  FROM tasks
  AS OF SYSTEM TIME '2026-06-23 14:22:00+00:00';

-- 2) Join the live table against the staged snapshot at current time.
SELECT cur.id,
       cur.status  AS current_status,
       snap.status AS status_before_incident,
       cur.updated_at
FROM tasks AS cur
JOIN snap_before_incident AS snap ON cur.id = snap.id
WHERE cur.status != snap.status;

-- Rebuild memory from a known-good point (single-table top-level AOST).
SELECT id, content, version, updated_at
FROM source_documents
AS OF SYSTEM TIME '2026-06-23 14:22:00+00:00'
WHERE updated_at < '2026-06-23 14:23:00+00:00';

Code 4: Checkpoint tables (durable pause and resume)

A loop that stores its pause point in the database finds its position again after any restart. A loop that stores it in application memory does not. The checkpoint row exists before the loop pauses it.

CREATE TABLE workflow_checkpoints (
  workflow_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  agent_id     TEXT NOT NULL,
  current_step TEXT NOT NULL,
  step_input   JSONB,
  paused_at    TIMESTAMPTZ,
  resumed_at   TIMESTAMPTZ,
  status       TEXT NOT NULL DEFAULT 'running'
               CHECK (status IN ('running','paused','approved','complete','failed')),
  created_at   TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE approval_queue (
  approval_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  workflow_id  UUID NOT NULL REFERENCES workflow_checkpoints(workflow_id),
  requested_by TEXT NOT NULL,
  decision     TEXT CHECK (decision IN ('approved','rejected')),
  decided_by   TEXT,
  requested_at TIMESTAMPTZ DEFAULT now(),
  decided_at   TIMESTAMPTZ
);
CREATE INDEX ON approval_queue (workflow_id, decision) WHERE decision IS NULL;

-- The checkpoint row exists before the loop pauses it.
INSERT INTO workflow_checkpoints (workflow_id, agent_id, current_step)
  VALUES ('<workflow-id>','agent-001','step_1');

-- Pause at a checkpoint
UPDATE workflow_checkpoints
  SET status='paused', paused_at=now(), current_step='awaiting_approval'
  WHERE workflow_id='<workflow-id>';
INSERT INTO approval_queue (workflow_id, requested_by)
  VALUES ('<workflow-id>','agent-001');

-- On restart: find exactly where the loop left off
SELECT wc.workflow_id, wc.current_step, wc.step_input, aq.approval_id, aq.requested_at
FROM workflow_checkpoints wc
JOIN approval_queue aq ON wc.workflow_id = aq.workflow_id
WHERE wc.status='paused' AND aq.decision IS NULL AND wc.agent_id='agent-001';

Code 5: Append-only audit trail

The difference between application logging and an audit trail is enforceability. The agent can write records but cannot modify or delete them, and the constraint lives in the database, not the application code. Tested: as agent_worker, both UPDATE and DELETE on this table are denied.

CREATE TABLE agent_audit_log (
  log_id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  agent_id       TEXT NOT NULL,
  workflow_id    UUID,
  action         TEXT NOT NULL,
  target_table   TEXT,
  target_row_id  TEXT,
  data_before    JSONB,
  data_after     JSONB,
  data_timestamp TIMESTAMPTZ,
  credentials    TEXT NOT NULL,
  logged_at      TIMESTAMPTZ DEFAULT now()
);

-- Insert allowed, modification denied. Enforced by privilege.
GRANT INSERT ON agent_audit_log TO agent_worker;
REVOKE UPDATE, DELETE ON agent_audit_log FROM agent_worker;

-- Write an audit record alongside the action it records
INSERT INTO agent_audit_log
  (agent_id, workflow_id, action, target_table, target_row_id,
   data_before, data_after, data_timestamp, credentials)
VALUES
  ('agent-001','<workflow-id>','UPDATE','tasks','<task-id>',
   '{"status":"pending"}','{"status":"complete"}', now(), 'agent_worker');

-- Reconstruct what the agent did during an incident window
SELECT log_id, agent_id, action, target_table, target_row_id, data_timestamp, logged_at
FROM agent_audit_log
WHERE agent_id='agent-001'
  AND logged_at BETWEEN '2026-06-23 14:20:00+00:00' AND '2026-06-23 14:30:00+00:00'
ORDER BY logged_at ASC;

Continue the Resilience and Recovery series

This article introduces the database failure patterns that make agent loops unreliable in production. Each failure has a dedicated deep dive in the “Resilience and Recovery” series, with full schema and tested code, publishing through August 2026. Links will be added here as each goes live: blast radius and backup separation, transaction integrity, rollback versus recovery, durable human-in-the-loop, memory integrity, and the agent audit trail.

Add these database rules to your AGENTS.md

The five patterns in this article map to agent configuration rules you can paste into your project's AGENTS.md. That file is read natively by Codex, Cursor, GitHub Copilot, and Windsurf, and by Claude Code via @import in your CLAUDE.md; Gemini CLI supports it through configuration. All SQL uses the schemas from the code examples above. Replace the placeholder values before deploying. Every SQL pattern is verified on PostgreSQL 16. The AS OF SYSTEM TIME block requires CockroachDB v24.3. Review the file before committing it: agents follow AGENTS.md instructions without validating their source, which makes it both a productivity tool and an injection surface worth auditing.

## Database rules for agent loops (CockroachDB)
### Transaction integrity
Always wrap related writes in one transaction. Never write two rows that must stay consistent as separate statements. If any step fails, all steps roll back.
    BEGIN;
      UPDATE tasks SET status='complete', updated_at=now() WHERE id='<task-id>';
      INSERT INTO task_results (task_id, output) VALUES ('<task-id>', '<output-json>');
    COMMIT;
Back every INSERT that may be replayed with ON CONFLICT DO NOTHING. A replayed step is a no-op, not a duplicate.
    INSERT INTO task_results (task_id, output)
      VALUES ('<task-id>', '<output-json>')
      ON CONFLICT (task_id) DO NOTHING;
### Role constraints
This agent runs as agent_worker: SELECT, INSERT, UPDATE only. No DELETE. No access to backup storage or backup credentials. Do not attempt any operation outside these permissions.
### Pause state
Create the checkpoint row before the loop may need to pause. On pause, write current position to the database. On restart, read from it before doing anything else.
    -- Create when the loop starts
    INSERT INTO workflow_checkpoints (workflow_id, agent_id, current_step, status)
      VALUES ('<workflow-id>', '<agent-id>', 'step_1', 'running');
    -- Pause
    UPDATE workflow_checkpoints
      SET status='paused', paused_at=now(), current_step='awaiting_approval'
      WHERE workflow_id='<workflow-id>';
    INSERT INTO approval_queue (workflow_id, requested_by)
      VALUES ('<workflow-id>', '<agent-id>');
    -- On restart: run this before doing anything else
    SELECT wc.workflow_id, wc.current_step, wc.step_input, aq.approval_id
    FROM workflow_checkpoints wc
    JOIN approval_queue aq ON wc.workflow_id = aq.workflow_id
    WHERE wc.status='paused' AND aq.decision IS NULL AND wc.agent_id='<agent-id>';
### Audit log
Write one record to agent_audit_log before completing any tool call that modifies a row. INSERT is permitted. UPDATE and DELETE are denied at the database privilege level.
    INSERT INTO agent_audit_log
      (agent_id, workflow_id, action, target_table, target_row_id,
       data_before, data_after, data_timestamp, credentials)
    VALUES
      ('<agent-id>', '<workflow-id>', '<action>', '<table>', '<row-id>',
       '<before-json>', '<after-json>', now(), 'agent_worker');
### Temporal reads [CockroachDB v24.3 -- not available in standard PostgreSQL]
Read the database as it existed at a past timestamp without running a restore.
    SELECT id, status, updated_at
    FROM tasks
    AS OF SYSTEM TIME '-30m';
Keep the timestamp inside the garbage-collection window (gc.ttlseconds, 25 hours by default). Do not use AS OF SYSTEM TIME in a subquery to compare current state against past state. CockroachDB will not mix a historical read and a live read in the same statement. Stage the snapshot first.
    -- 1) Stage the snapshot
    CREATE TABLE snap_before_incident AS
      SELECT id, status FROM tasks
      AS OF SYSTEM TIME '<incident-timestamp>';
    -- 2) Join the live table against it
    SELECT cur.id, cur.status AS current_status, snap.status AS status_before
    FROM tasks AS cur
    JOIN snap_before_incident AS snap ON cur.id = snap.id
    WHERE cur.status != snap.status;

Sources

Peter Steinberger (@steipete) on X, June 7, 2026

Boris Cherny (Anthropic), via Latent Space AINews, June 2026

swyx, "Loopcraft: The Art of Stacking Loops," Latent Space, June 2026

LangChain (Sydney Runkle), "The Art of Loop Engineering," June 2026

Addy Osmani, "Loop Engineering," June 2026

Andrej Karpathy, autoresearch

EU AI Act, Articles 12 and 26

CockroachDB docs, AS OF SYSTEM TIME

Built for AI-driven scale

Unify operational data, vector search, and durable agent state in one resilient, distributed SQL database. Start with $400 in free credits. Trusted by Fortune 50 financial institutions and teams in 40+ countries.

Quentin Packard is VP of Americas Sales at Cockroach Labs, where he works with engineering and infrastructure leaders building production-grade agentic AI systems. He previously helped build Splunk’s observability business and has worked across infrastructure automation, secrets management, and real-time data governance at HashiCorp and early stage startups. His writing draws on direct conversations with enterprise teams navigating AI and data architecture in production.

Why Agent Loops Fail in Production (and the Database Patterns That Fix Them)

Why does AI agent reliability depend on the data layer?

Related

Why do agent loops break in real deployments but work in a demo?

Which agent loop failures are actually database problems?

7 database failure modes that break production AI agents

What does a database need to support reliable agent loops?

Related

When does a loop need a transaction?

What happens when a loop pauses for human review?

How do you audit what a loop did across thousands of iterations?

Related

How do you improve these database patterns? Code 1: Safe retries

Code 2: Role scoping (blast radius)

Code 3: Temporal reads [CockroachDB]

Code 4: Checkpoint tables (durable pause and resume)

Code 5: Append-only audit trail

Continue the Resilience and Recovery series

Add these database rules to your AGENTS.md

Sources

Built for AI-driven scale