AI SLOPCAST #7: A New Way to Launch the Calculator
Today we're going to talk about agent security. There's going to be a great deal about security. If that's not your scene, by all means—skip ahead to the next video.
But first, a timeline. Monday, May 4th: Shanghai Jiao Tong publishes ARIS, an open-source research harness that, structurally speaking, competes with what Anthropic happens to sell for money. Tuesday, May 6th: AWS announces the general availability of its MCP Server. The same day, Anthropic ships its managed runtime, and Cursor adds per-agent context observability—as we say in the Russian villages. Wednesday, the 7th: the Microsoft Security Blog publishes two CVEs in Semantic Kernel, demonstrating that if an agent is running on your machine, then in the default configuration an attacker can, with a single prompt, launch the calculator on your computer. No clever browser sleight of hand, no elaborate setup, no out-of-bounds memory hacks. You simply type a string into a chat window—and a calculator appears on the screen. No confirmations, no guardrails.
Part One: A New Way to Launch the Calculator
We'll start with Microsoft, because it's the most vivid case. Microsoft has this thing called Semantic Kernel. It's a neural-network kernel and a set of tools for building neural-network agents on Windows.
And of course it got cracked open right away (CVE-2026-26030). The attack mechanism: prompt injection in the default Semantic Kernel configuration. You don't even have to tweak anything. Say you've got an agent with a plugin for searching a vector store—the standard setup, the very one laid out in Microsoft's tutorials. The attacker tucks an inconspicuous payload into some text, and the agent obligingly reads it and executes it. The text might live on a webpage, in some Word document, in an email. In theory, the agent has defenses against bad prompts, but against this particular prompt the defenses don't fire. The malicious snippet reaches the command-execution machinery, where there's something resembling a strategy for running code—something along the lines of eval on an arbitrary string. And that's it: arbitrary code execution on the host where the agent is running.
Microsoft itself made the calculator demo. The calculator is something close to an archaeological artifact at this point. A calculator in a security demonstration is the standard way of showing that you've successfully gotten something to run on the machine you've attacked. The tradition goes back to Stuxnet. When you see a calculator in an exploit demo, you should read it as: "this is a demonstration of arbitrary code execution on Windows, it really works, and therefore it deserves to be taken seriously." Microsoft didn't pick that demo by accident. What they're saying, in effect, is: this is not just another chatbot bug. This is a new way of doing arbitrary-code-execution attacks.
And of course, this is not a bug peculiar to Microsoft's semantic kernel. The pattern is clear: you need a search plugin in your system, plus a vector store, plus the ability to eval—to execute—an arbitrary string. And it happens that the foundational architecture of every modern agentic framework looks like this. LangChain. CrewAI. AutoGen. LangGraph. It's everywhere. Microsoft simply chose to be the first to disclose all of this, because it would be discovered anyway—Semantic Kernel is open-source, after all; outsiders read the code, and sooner or later they'll dig it all up. If Microsoft doesn't go first, outside security researchers will, and they'll turn it into a scandal, and Microsoft will lose money.
I'd wager that within a couple of months, equivalent vulnerabilities will be found and published for LangChain, AutoGen, and the rest of the field. It's a direct consequence of the nature of the problem. And if no one writes about them, that's worse. It doesn't mean LangChain is clean. It means vulnerabilities like this get found, get sat on, and then get used for one purpose or another.
Aside: A Little About Science
In parallel, three academic releases happened. ARIS came out of Shanghai. PARNESS appeared on arXiv. MinerU2.5 showed up on Hugging Face.
All three are open-source, and all three are, in essence, improvements to AI infrastructure for scientists and researchers. ARIS does what Anthropic Outcomes does—only more openly. PARNESS does what Anthropic Managed Agents does—only it was published as a paper, free and in the open.
And MinerU is a thing that turns various document formats into Markdown. Which is to say, if you've got a photograph of some text, or some inhumane Excel file produced by your savant colleague—the guy who is, in his way, a genius but who communicates with the outside world exclusively in Excel files—all of that detritus can be transformed into proper Markdown documents and then, for instance, fed to an AI agent.
A coincidence? I don't think so. It feels as if the AI-agent industry has finally roused itself and is starting to take seriously the more serious applications, beyond merely generating code. Lately everyone has been losing their minds over building agents that replace programmers. But to my mind, the real power of AI lies in scientific and research applications.
Neural networks are not going to start writing other neural networks on their own. Skynet is not going to take over the world by itself. We're the ones who have to make it happen. And to do that, we need tools that let neural networks set in motion a process of continuous self-improvement. And at long last, some reasonably serious companies and researchers have started to think about this.
Part Two: The Kubernetes Story Repeats Itself
Let me tell you something that clarifies a great deal. If you worked in DevOps in the mid-2010s—around 2014 or 2018—you'll recognize it immediately.
In 2014, Google open-sources Kubernetes. Docker Inc. ships Swarm. Mesosphere ships Mesos. Academia publishes papers on Borg, Omega, Quasar. Alternatives appear—rkt with its pod-as-capability model, for instance. Only a couple of years later do the first major vulnerability disclosures and CVEs arrive. Privilege escalation in Kubernetes, for one.
What happened next, we all know. Kubernetes won, of course. The alternatives, which mostly copied its features, all but died out. The big vulnerabilities kept getting disclosed for years—well past 2020. The industry, by and large, came around to the idea known in English as "assume breach": we read the writing on the wall, conclude that compromise is inevitable, and defend ourselves with namespaces, RBAC, and a prayer.
Now look at this. Right now, with agentic runtimes, exactly the same thing is happening that happened to Kubernetes in 2014. A swarm of vendors are shipping runtimes: Anthropic Managed Agents, AWS Agent Toolkit, Microsoft Azure Agents, Cursor SDK. Academic alternatives start to appear—ARIS, PARNESS, just now. The first major vulnerabilities start to surface—just as they have, in Microsoft Semantic Kernel. Solutions that would actually constrain what agents can do already exist—you can, for instance, exert tight control over WebAssembly's security, and the Bytecode Alliance proposed a special profile for agent runtimes back in March of this year. And—mark this—not one of the major vendors is using any of it.
If the Kubernetes story repeats—and every signal suggests it will—then a year or two from now we're going to be looking at:
One or two dominant commercial runtimes for agents. The alternatives will exist, but they'll be deeply marginal. Vulnerabilities will keep arriving in an endless stream. And the industry as a whole will simply make its peace with the assume-breach model—we'll always assume a breach is inevitable. There'll be a whole new category of security tools, just being born now—Microsoft Agent 365 Shadow AI Detection, AWS IAM, Anthropic Cyber Verification—and they'll provide the outer perimeter of defense, because no one has yet figured out how to defend the inside of these runtimes.
You might call this a deeply pessimistic forecast. To me, it's just reality. It's the reality we already live in, even if the wider world is, on the whole, ignoring it. Technical planning has to start from that fact. If you're planning to deploy agents in production within the next year, build assume-breach into the plan, and pick an architecture that can work inside that paradigm. Don't pin your hopes on security patches. Security patches for agents are an oxymoron. You have to layer security on through architectural choices. Network segmentation, say. Mandatory rotation of passwords and other credentials. Isolation at the level of specific capabilities and privileges—wherever it can be done.
But that, of course, is only for production. If you're just running Claude Code on your own machine, well—you flip on bypass permissions and hope for the best. Worst case, a compromised computer can always be wiped.
Part Three: The Least Frightening Demonstration
The calculator demo is, deliberately, designed to look harmless.
Consider. An agent in production has plenty of privileges. Filesystem access—it reads and writes your documents and your code. Network access—it makes API calls to internal services. Process-execution access—it runs build tools and tests. Access to passwords and other credentials—it has API keys and OAuth tokens, almost certainly sitting in environment variables. Database access—often the agent has write access (!) to any of your application databases. Access to cloud-provider SDKs—as a rule, the agent can manipulate authentication and IAM across your entire AWS account.
What the calculator scenario means, on the broadest level, is this: a single agent has every one of those capabilities at once. Each one was dangerous on its own; now you've got them all together. What's the realistic consequence?
For starters, exfiltration of every password you've got—the agent reads your environment variables and ships them off to some endpoint on the open internet. Database tampering—via the agent's connection to your database, an attacker can plant fraudulent data; in an e-commerce shop, say, drop the price of an expensive item to nothing and have the agent buy it on the spot. Manipulation of cloud resources—an attacker stands up bitcoin mining inside your AWS cluster, drags trade secrets out of your S3 buckets, modifies IAM policies so that you'll never know the attacker now owns a perfectly legal, perfectly untraceable admin account. Beyond that, the agent holds credentials to every other service you've worked with—which means it can keep propagating and breach those services too. And, of course, supply-chain compromise—if the agent has write access to your repository, it adds backdoors directly to your code, and from there your code starts breaching itself.
Microsoft only showed the calculator. They could have shown any of the above. They wisely declined to walk the general public through the full catalogue of horrors—their share price would have ticked downward. But to the experts, the message was unambiguous: launching the calculator is a perfectly competent demonstration of arbitrary code execution, just one without the reputational fallout. They picked the least frightening demo.
What follows from this. If you're running Semantic Kernel in production, then yes, patching to version 1.71 is necessary—but insufficient. The patch closes a specific known hole. Architectural defense—privilege isolation, segmentation, credential rotation—architectural defense closes off the entire category. And until you've closed off the entire category, your agent in production has a level of privilege Microsoft was too polite to put on screen.
Remember this. Patches fix specific problems. Picking the right architecture fixes a whole class of problems at once. To the non-specialist, the two are easy to confuse. But the confusion is expensive.
Part Four: The CVE Trap
Microsoft Semantic Kernel is open-source. Microsoft published two vulnerabilities. On the page, two. In the CVE database, also two. A year from now, perhaps ten. Someone will look at that number and conclude: "Semantic Kernel is an unsafe framework—look at all the CVEs."
Now consider Anthropic Managed Agents. Not open-source; closed-source. AWS Agent Toolkit. At least in part, also closed-source. Cursor SDK. Closed too. How many publicly documented vulnerabilities do they have? Zero. One. Perhaps two. Someone will look at that number and conclude: "Anthropic is safe—they hardly have any CVEs."
This is a dangerously wrong conclusion. There is nothing further from the truth.
Open-source frameworks have lots of CVEs because the machinery for finding vulnerabilities is well-oiled. Researchers find them, report them, vendors patch the products, and in the end the project discloses and publishes the vulnerabilities. Closed-source frameworks have few CVEs because researchers can't get at the code. The vulnerabilities that are found, via black-box testing, are first routed into private bug-bounty programs. Some are patched quietly, so that no one finds out. Others never make it to the public CVE database at all. Internal discoveries by the vendor's own security people don't, as a rule, end up in the public CVE database.
The upshot is that the public CVE metric is systematically biased in favor of closed-source products. The customer, in turn, picks a product on a completely false premise, and concludes that the whole menagerie of opaque proprietary vendor utilities is somehow safer. When in fact the picture may well be the opposite.
It's like the old joke about the man who lost his keys at night and looked for them not where he dropped them but under the streetlamp. You search where it's lit, not where you lost the thing. With AI agents, this has only just begun to be visible. The entire category of vulnerabilities is very young, and the CVE metric is only just starting to produce any data at all. But this early period is a critical one. It's right now that users will start deciding which products they consider safe and which they don't. And that reputation will stick to a vendor for years.
So here's another practical takeaway from this episode. Don't use raw CVE counts as the headline metric for assessing the security of agentic frameworks. Ask the vendor for a document describing their vulnerability-disclosure policy. Ask for data on how quickly they patch. Ask for the percentage of vulnerabilities found internally by the vendor's own staff—versus the percentage found outside (by customers, by researchers). If the vendor can't answer these questions, that's its own answer: things are bad. Just picture the vendor's face when you ask. If things are really bad, they'll burst on the spot. In short: CVE counts are not the basis. Disclosure policy is the basis.
Part Five: WASM to the Rescue
Now let me play Cassandra and offer a prediction. Within a year, give or take, one of the major vendors will release a runtime for agents that runs inside an isolated WebAssembly sandbox. A WASM-isolated agent runtime as a standalone product. My money's on Cloudflare. Or, alternatively, I'll do it myself—it's that obvious. Allow me to explain.
If we get WebAssembly with capability-based security—that is, security organized around explicit privileges and capabilities—we get a technical solution to the category of problem we saw with the calculator. The principle is simple: the agent does not have full access to operating-system resources. At launch, it's explicitly granted specific privileges. It wants the filesystem? It gets it via an explicit request. It wants the network? Via an explicit request. If the agent has the capacity to execute an arbitrary string, that eval cannot escape the cage that's been built around it. Because it has no way to launch arbitrary processes. Ever. No matter what the prompt says.
This architecturally rules out the vulnerabilities we've just been discussing. It's not a patch—it's defense at the semantic level.
The implementations already exist. Wasmtime, Wasmer, Cloudflare Workers' WASM runtime. None of the major agent vendors are using them. Vendors typically argue that the issue is performance—WebAssembly isn't especially fast, and shaves off at least fifteen percent or so on typical workloads.
But, in my humble opinion, the real reason isn't performance. The real reason is vendor lock-in. Vendors very much want to chain every customer to their infrastructure. WebAssembly is portable by its very nature. An application like this would run on any WASM runtime. The vendor does not want portability. The vendor wants you in indentured servitude, forever.
Which is where Cloudflare stands out. Their whole business model is built on edge compute with portable workloads. Their Workers AI already supports inference on the user's own devices. They already have AI agents in development, announced at Birthday Week 2025. Their competitive position requires portability—otherwise they're no different from Amazon. Or rather, they're different, but worse off. If they ship an isolated runtime for agents that runs on WebAssembly, it sabotages every other vendor's lock-in racket. Because the rest of them, Microsoft included, have no answer. They're not going to ship a product that breaks the very vendor lock-in they've spent years constructing.
I'm not a hundred percent certain it'll be Cloudflare. It might be Vercel, Fly.io, Modal. It might, surprise of surprises, be Replit. Or it might just be Oleg Chirukhin, because I have a perfectly clear idea of what could and should be done. But that a portable runtime is coming—that's plain. And when it does, it'll be the first moment at which all these proprietary, lousy, leaky solutions get the kick they deserve. And it will be a beautiful thing.
In Closing
Let's bring it together.
This past week saw the birth of agentic runtimes as a category of infrastructure. The category was always there, technically; but this was its moment in the sun. And in that same moment, Microsoft publicly demonstrated that the foundational mechanisms of this infrastructure are a new vector for arbitrary-code-execution attacks.
It's a signal that agents are becoming a grown-up industry. The same thing happened to containers in 2018. The industry survived by accepting the idea of inevitable compromise as the new normal. Agent runtimes will go the same way. But that means the corresponding architecture isn't optional—it's mandatory. And it isn't something to roll out later; it has to be rolled out now.
Two previous episodes of the podcast.
First came the episode about memory and evolution. About the dreams that neural networks can now have. About Anthropic dreaming. AlphaEvolve. For the first time, agents have what every living organism has.
The next episode was about meme-worthy figures. Musk, Schneier, Zitron, Tao. The industry behaves like a TV series running for many seasons, with favorite characters and a recognizable cast.
And this episode—this one was about infrastructure and security. About the runtimes being assembled right now, before your very eyes, with all their congenital flaws and vulnerabilities.
And those three episodes together—they are, in a sense, about constructing a reality that didn't exist a week ago. About memory. About characters. About infrastructure. It's a new space, one that's only going to get larger, and one that we now have to live in. A year from now, when we look back on this week, we'll be amazed at how much of it has become routine and ordinary.
Thank you for being with us. Until next time.