The coverage matrix

All 43. Where we actually stand.

Why these forty-three failures are one problem and not forty-three is the argument we make on the problem. This page is the accounting that backs it.

Below is every problem the industry lists, where we stand on each, and what is actually solving it. We mark a row Shipped only when there is a receipt you can check. Everything else we call what it is.

This list is not ours. Every problem on it is documented in the public record: see the sources.

Jump to a category

AI agent security ·Reliability and control ·Cost and FinOps ·Adoption and ROI ·Shadow AI and governance ·Context and memory ·Workforce ·Vendor and lock-in ·Trust, evaluation, and quality ·Where it shows up in the work ·The bigger picture

Watch the rust chips

Watch the highlighted chips as you read. One system keeps doing the work: the brain we built. It is the answer on 18 separate rows, including every single problem under Context and Memory, and it is what keeps the AI honest, traceable, and cheap across the rest. That is what "one problem in 43 costumes" looks like in practice. Not 43 fixes bolted on one at a time. A handful of systems, one of them dissolving a whole cluster at once.

Shipped

In progress

Roadmap

Outside the model

43 named problems. The brain appears on 18 of them. Every Shipped chip links to its receipt on /proof. In progress, Roadmap, and Outside rows do not, by design.

The 44th problem

There is a 44th problem nobody lists: overclaiming.

Calling a demo production, a plan shipped, a mitigation a cure. It is the most common problem in AI right now, and the easiest one to commit. So we mark a row Shipped only when there is a receipt you can check, and we call everything else what it is. This page updates as we close the gaps. That is the whole point.

AI agent security

The new hire should not get the keys to everything on day one.

Over-permissioned agents

Keystone Shipped

The AI carries no standing credentials. It pulls keys from encrypted storage only when it needs them, and its file access is fenced to an allowlist that hard-denies everything else. A security audit found and closed a dormant admin login and a redirect vector.

Credentials in code or prompts

Keystone Shipped

Secrets never live in the source. The one time a key showed up in a chat, it was caught and rotated in about thirty minutes, and logged as a mistake instead of buried.

Connector and tool security (MCP)

Keystone Shipped

Every tool the AI can reach is scoped to an allowlist, and the server refuses anything outside it.

Prompt injection

Keystone The Brain The operating model In progress

Nobody has solved this, ours included. We bound the damage instead: no standing credentials in the model path, least privilege, and a human gate on anything irreversible.

The lethal trifecta (private data, untrusted input, a way out)

Keystone The Brain The operating model In progress

The architecture is built to never have all three at once. The model path holds no credentials and no built-in way to send data out, and actions are human-gated.

Agent blast radius on your live systems

Keystone Roadmap

Containing what an agent can break across your live environment is where we are headed. We run it on our own systems today. We do not claim it on yours yet.

Reliability and control

A brilliant junior who will confidently do the wrong thing, fast.

No human gate on irreversible actions

The Brain The operating model Shipped

Reading and routine work run on their own. Anything you cannot undo waits for a human to approve it.

Guardrails that govern actions, not just words

The Brain The operating model Shipped

Our guardrails stop unsafe actions like publishing, deploying, and handing off work. They are not just text filters.

"Set it loose and walk away"

The operating model The Brain Shipped

We published the opposite as our position: you lead the AI like a team, set direction, and gate what cannot be undone.

Reliability breaking down over many steps

The Brain The operating model In progress

We instrument drift and force checks between steps. That reduces multi-step failure. It does not eliminate it, and we say so.

Hallucination and context poisoning

The Brain In progress

Every claim is tagged as verified or inferred, and the failure mode is named and guarded. Mitigation, not a cure.

Cost and FinOps

The bill is where ungoverned AI shows up first.

Runaway token cost

The Brain Shipped

Our whole company ran on under $50 of compute from nothing to launched. The cost receipts are public. The brain is why: it feeds the AI only what each task needs.

Cost that climbs as context piles up

The Brain Shipped

We fetch only what a task needs and point to the rest, instead of resending everything at every step. That is the fix for the runaway curve, and it is how the brain works.

Nobody owns the bill

The operating model Shipped

One person owns spend, with nightly instrumented numbers behind the public receipts.

Agentic fan-out and runaway loops

Keystone In progress

The agentic loop is capped by a hard, coded ceiling in the live path, plus a per-IP limit on public chat. That is deployed and verified today, a coded counter rather than a concurrency reservation. Broader fan-out across many agents is the part still hardening.

Budget alerts that do not actually stop anything

Keystone In progress

We bound the work, not just alert on the bill: a hard loop ceiling and a per-IP cap stop runaway runs before they spend. A direct spend threshold is on the list, not wired yet.

The wrong model for the job

Keystone In progress

We match the model class to the workload, set per workload, instead of forcing one model on everything, and prompt caching is wired into the live call, so a repeat conversation reuses its cached prefix at a fraction of the cost instead of paying for it again. Dynamic per-task routing is still on the build list.

Adoption and ROI

Where the 95% actually fails.

"95% of pilots return nothing" and pilots that never ship

The operating model Shipped

Our system is in production, not a pilot, with a public production receipt. We didn't build it for a demo. We built it for ourselves and run the whole company on it.

Projects cancelled for cost, unclear value, or weak controls

The operating model Shipped

Those three are exactly what the model governs: cost, measured value, and gated risk.

Agent washing

The operating model Shipped

We forbid overclaiming in writing. Zero paying tenants is stated plainly. We do not dress up a demo as production.

Bolting AI onto a broken process

The operating model Shipped

We redesign how the work runs first. AI on a broken workflow just makes the mess faster.

No measurable ROI

The operating model The Brain Shipped

Instrumented metrics and public cost and proof pages, with every claim sourced.

Shadow AI and governance

If you cannot prove the controls ran, you do not have controls.

Audit-trail gap

Keystone Shipped

Every action writes to an audit trail that actually runs: multi-region logging, validation on, to a locked and versioned store.

Provenance and traceability

The Brain Shipped

Every claim is sourced, every change is in history, and the cross-instance record is kept.

Governance and compliance (NIST, EU AI Act)

The operating model Keystone In progress

We keep a documented NIST self-assessment, honestly hedged. We are not certified and do not claim to be.

Shadow AI

The Brain Keystone The operating model In progress

One canonical brain, gated publishing, and auditable use are the structural answer. We are not selling a shadow-AI detector.

Context and memory

AI forgets, and quality rots as the window fills.

Context rot as the window fills

The Brain Shipped

We curate context per step and work in deltas, so quality does not degrade as the session grows.

Context engineering as a real discipline

The Brain Shipped

It is a named method here, not an afterthought.

No persistent memory across sessions

The Brain Shipped

A living knowledge base and session protocols carry context across sessions, and shared memory connects the AI team.

"Your data is not AI-ready"

Outside the operating model

Cleaning and structuring your data is a discipline of its own, with specialists who do it well. We run on top of that, we do not replace it. Happy to point you to the people who do.

Workforce

Do not fire the people who should be leading the AI.

Layoffs the AI cannot actually back up

The operating model Shipped

Our stated position: stop firing the humans who should be managing these new teams. The model is human-led by design.

Nobody owns AI, and no AI-literate leaders

The curriculum The operating model Shipped

We name the owner and teach the leadership skill the model depends on.

Deskilling and atrophied judgment

The curriculum In progress

We teach your people to run it, so they become the trainers. Skill-building by construction.

Vendor and lock-in

A wrapper with no differentiator is one update from irrelevance.

Wrapper startups with no differentiator

The operating model Shipped

The differentiator is the way of working and 27 years behind it, not the software. Products get copied. A method does not.

Vendor and model lock-in

Keystone In progress

The model identity is swappable at the boundary, on cloud infrastructure we control, not welded to one provider.

Platform risk, one update from irrelevance

The operating model In progress

If a provider ships a feature natively, the operating method still stands. We plan for it.

Trust, evaluation, and quality

Output is easy. Judgment is the hard part.

No evaluation discipline

The Brain Shipped

The brain runs a stack of checks on the work: a scope gate, a drift linter in fail mode, session-health tracking, and source tagging on every claim.

Last-mile quality and judgment

The operating model Shipped

Our published position: judgment over raw output, with a human owning the last stretch.

Model drift and silent degradation

The Brain In progress

The brain tracks the health of each session and flags when one starts to drift, catching silent metric failures before they compound. Mitigation that runs, not a promise it can't happen.

Where it shows up in the work

The same problem, in the day-to-day.

AI-written software (the "70% there" problem, insecure code, churn)

Refactory The Brain The operating model In progress

Code passes through gates, review, and the security posture, with a human owning the last stretch and a dedicated checker on legacy modernization.

Text-to-SQL and analytics

Outside the operating model

Turning plain questions into database queries is a crowded space with good tools already in it. It is not our lane. We will point you to the ones who own it.

A specific industry vertical

Outside the operating model

We have not committed to one industry. A specialist who has lived in your domain for years beats a generalist who claims to. Our edge is the way we operate, and that travels across domains. Where you need deep vertical depth, that is someone else's lane.

The bigger picture

The "AI bubble" and the trough of disillusionment

The operating model Shipped

The whole posture is the answer: governed, cost-honest, ROI-proven, anti-hype. When the cull comes, that is where trust goes.

The legend

Status

Shipped: Built and running, with a receipt you can check.
In progress: Real and working, still being hardened, or partly unverified in production. We will not call it done early.
Roadmap: Where we are going. Not built for a customer yet.
Outside the operating model: These are real problems. They just sit outside what the operating model is built for. Someone else's lane, and we say so.

Solution (what is doing the work)

The Brain: The living system the operator trains. It holds your knowledge, holds the rules for the gates that flag a decision for a human, and remembers across every session.
Keystone: The production runtime it all runs on.
The operating model: The human-led way we run it, including the call a person makes when a gate fires.
The curriculum: How your people learn to lead it.
Refactory: Modernizing legacy code.

Where this list comes from

Not ours. Here is where each cluster comes from. Where a cluster reflects broad practitioner consensus rather than one named study, we say so. Same rule as the 44th problem.

AI agent security

OWASP · 2025: Top 10 for LLM Applications: prompt injection is LLM01; excessive agency a named category
OWASP · 2025: Top 10 for Agentic Applications (Black Hat Europe): autonomy, memory, tool/credential access, blast radius
Simon Willison · 2025: The lethal trifecta for AI agents (also coined "prompt injection")

Reliability and control

OWASP · 2025: Top 10 for Agentic Applications: unsafe autonomous actions, cascading failures, loss of containment

Cost and FinOps

Gartner · 2025: Escalating cost named a top driver of agentic-project cancellation

Adoption and ROI

MIT NANDA · 2025: The GenAI Divide: State of AI in Business 2025 (Challapally, Pease, Raskar, Chari). 95% see no measurable return
Gartner · 2025: Over 40% of agentic AI projects canceled by end-2027 (cost, unclear value, weak controls); agent-washing, ~130 of thousands genuinely agentic

Shadow AI and governance

MIT NANDA · 2025: The GenAI Divide: the "shadow AI economy"
NIST · 2023/2024: AI Risk Management Framework 1.0 (AI 100-1); SP 800-171 Rev 3
European Union · 2024: EU AI Act: Regulation (EU) 2024/1689
Stanford HAI · 2025: AI Index 2025: AI incidents up 56% to 233; responsible-AI governance gaps

Context and memory

Chroma · 2025: Context Rot (Hong et al.): 18 frontier models degrade as context grows, before the window limit
MIT NANDA · 2025: The GenAI Divide: enterprise persistent-memory gap

Workforce

Stanford HAI · 2025: AI Index 2025: economy and labor analysis

Vendor and lock-in

Practitioner consensus Industry and VC practitioner discourse on thin wrappers with no differentiator and platform/model risk. Widely discussed, no single authority. Gartner's agent-washing finding (~130 real vendors of thousands) partially backs the no-differentiator concern.

Trust, evaluation, and quality

Stanford HAI · 2025: AI Index 2025: evaluation/benchmarks (SWE-bench, MMMU, GPQA) and responsible-AI measurement gaps

Where it shows up in the work

Addy Osmani (Google) · 2024: The 70% problem: Hard truths about AI-assisted coding. Last 30% (edge cases, security, prod) stays hard; frames AI as "a very eager junior developer"

The bigger picture

Gartner · 2025: Hype Cycle: the Trough of Disillusionment framework; plus the 40%-cancellation / agent-washing data

The math

Cost

Price your own AI bill, and see the under-$50 receipt from nothing to launched.

The receipts

Proof

The build priced against a team. $1.4M–6.4M of equivalent build.

Back

Home

Wolfberg LLC. AI is your newest team, not your newest tool.

Start the conversation Learn to run this