The Agentic Engineer
- GPT-5.5 is here. 82.7% on Terminal-Bench 2.0, fewer tokens per task than 5.4, and it beats Opus 4.7 on most agentic benchmarks. OpenAI just reclaimed the top of the leaderboard.
- Stanford analyzed 6,000 real coding agent sessions. 44% of agent-produced code gets thrown away. 41% of sessions are pure vibe coding. The first empirical reality check on how agents are actually used.
- ToolSimulator in the Strands Evals SDK lets you test agents against realistic mock tools without live API calls.
pip install strands-evalsand stop choosing between brittle mocks and production risk.
GPT-5.5: OpenAI Reclaims the Agentic Crown
OpenAI dropped GPT-5.5 on April 23. The benchmark numbers tell the story: 82.7% on Terminal-Bench 2.0, 78.7% on OSWorld-Verified, 84.9% on GDPval. It beats Claude Opus 4.7 on nearly every agentic benchmark we track.
The efficiency angle matters more than the raw scores. GPT-5.5 uses fewer tokens to complete the same Codex tasks as GPT-5.4 while matching its per-token latency. Cheaper and better is a rare combination in this space.
Context: a new study this week found that agentic coding tasks consume 1,000x more tokens than code chat. The same task can vary 30x between runs. Models that burn fewer tokens per task aren't just saving money. They're reducing the variance that makes agent behavior unpredictable.
GPT-5.5 is rolling out to Plus, Pro, Business, and Enterprise tiers in ChatGPT and Codex. API access is imminent. For teams running Codex in production, the upgrade path is straightforward: same API, fewer tokens, better results.
The competitive picture shifted fast. Two weeks ago, Opus 4.7 was the clear leader on coding benchmarks. GPT-5.5 reclaims that position while also being more token-efficient. Anthropic's advantage now rests on differential capability reduction (the cyber safeguards from Opus 4.7) and the managed agents platform. OpenAI's advantage is raw performance per token.
For builders, the practical question is whether your agent harness is model-agnostic. If you're locked into one provider's SDK, you can't chase the frontier as it moves. If you're using LangChain, LangGraph, Strands, or any framework that abstracts the model layer, swapping GPT-5.5 in is a config change.
The model race looked settled two weeks ago. It wasn't. Expect Google to respond. Gemini 3.1 Pro just launched Deep Research Max (see Quick Hits), and the next frontier model announcement is probably weeks away.
Source: OpenAI Blog
Amazon Bedrock AgentCore: From Idea to Running Agent in Minutes
The new managed agent harness in AgentCore lets you declare an agent and run it in three API calls. Define your model, tools, and instructions. AgentCore handles compute, memory, identity, and security. The new AgentCore CLI covers the full lifecycle from one terminal. Pre-built skills give coding agents curated knowledge of AgentCore patterns. Kiro ships with it today. Claude Code, Codex, and Cursor support drops next week.
Source: AWS Blog
Google Deep Research Max: Autonomous Research on Gemini 3.1 Pro
Google ships two tiers: Deep Research (fast, interactive) and Deep Research Max (max quality, async). Built on Gemini 3.1 Pro. First research agent with native MCP support for connecting proprietary data streams. Max uses extended test-time compute for iterative reasoning. Designed for background workflows like nightly due diligence reports.
Source: Google Blog
context-mode: 98% Context Window Reduction for 14 Coding Agents
context-mode hit 10,536 stars (+2,504 this week). An MCP server that sandboxes tool output so raw data never hits your context window. A Playwright snapshot drops from 56KB to under 2KB. Tracks session state in SQLite with FTS5 search, so when the conversation compacts, the agent picks up where it left off. Forces agents to write scripts instead of reading files into context. Supports Claude Code, Cursor, Copilot, and 11 more.
Source: GitHub Trending
Claude Managed Agents Get Persistent Memory (Public Beta)
Anthropic adds persistent memory to managed agents. Agents can now retain context across sessions. This was the key missing piece for long-running autonomous workflows. Follow-up to the managed agents launch we covered in Issue #8.
Source: OpenTools / Anthropic
Google Antigravity Sandbox Escape: Prompt Injection Gets RCE
Pillar Security found that a native file-search tool in Google's Antigravity executes before Secure Mode can evaluate it. Combined with prompt injection, attackers got full remote code execution under the exact configuration security-conscious users rely on. Since patched. The pattern is the story: native tools that bypass sandbox evaluation are a class of vulnerability, not a one-off bug.
Source: CyberScoop / Pillar Security
SWE-chat: What 6,000 Real Coding Agent Sessions Actually Look Like
Core insight: Stanford built the first large-scale dataset of real human-agent coding interactions. 6,000 sessions. 63,000 prompts. 355,000 tool calls. The findings challenge the narrative that coding agents are reliable co-pilots.
The numbers: Only 44% of agent-produced code survives into commits. Usage is bimodal: 41% of sessions are pure "vibe coding" where the agent writes everything, 23% are human-only with the agent as a search tool. Users push back on agent suggestions in 44% of all turns.
Security finding: Agent-produced code introduces more security vulnerabilities than human-written code. The study doesn't quantify the exact ratio, but the pattern is consistent across session types. Agents optimize for "code that works" over "code that's safe."
Why builders should care: If you're measuring agent productivity by lines of code generated, you're measuring the wrong thing. More than half of those lines get deleted. The real metric is lines that survive review and ship. SWE-chat gives you the methodology to measure that in your own codebase.
Practical takeaway: The 44% pushback rate means agents are wrong often enough that human review isn't optional. Teams that treat coding agents as autonomous contributors (assign a ticket, merge the PR) are shipping code that humans would have rejected nearly half the time. The Amazon senior-engineer sign-off policy from Issue #4 looks more justified every week.
Time saved: 5 min read vs 42 min paper. 8.4x compression.
ToolSimulator: Test Your Agents Without Breaking Production
AWS ML Blog · Strands Evals SDK
Testing AI agents that call external tools has always meant choosing between two bad options. Live APIs risk PII exposure, unintended side effects, and unpredictable costs. Static mocks break the moment your agent runs a multi-turn workflow. ToolSimulator is the middle ground.
Part of the Strands Evals SDK, ToolSimulator intercepts your agent's tool calls and routes them to an LLM-based response generator. The responses are realistic, context-aware, stateful across turns, and validated against Pydantic schemas. No handwritten fixtures required.
Install and run:
pip install strands-evals
Define a test scenario:
from strands_evals import ToolSimulator
sim = ToolSimulator(
tools=my_agent_tools,
scenario="User requests a refund for order #12345"
)
# Run your agent against simulated tools
result = sim.evaluate(agent, prompt="Process the refund")
# Check: did the agent call the right tools
# in the right order with valid parameters?
assert result.tool_sequence == ["lookup_order", "process_refund"]
assert result.schema_valid # All calls matched Pydantic schemas
The stateful simulation is the key feature. If your agent calls lookup_order and then process_refund, the simulator remembers the order details from the first call and uses them to generate a consistent response for the second. Static mocks can't do this without manual wiring.
Schema validation catches a class of bugs that integration tests miss. Your agent might call the right tool with the wrong parameter types, and a live API might silently coerce the input. ToolSimulator rejects it, surfacing the bug before production does.
Works with any agent framework. The simulator sits at the tool layer, not the orchestration layer. Strands, LangChain, CrewAI, or your custom harness. If your agent calls tools, ToolSimulator can intercept them.
Status: GA in Strands Evals SDK
Framework Star Tracker
Weekly star tracker, April 29, 2026. Deltas vs. Issue #9 (April 22, 2026).
| Framework | Stars | Weekly Δ |
|---|---|---|
| OpenClaw | 365,061 | +4,189 |
| n8n | 185,764 | +962 |
| Dify | 139,308 | +881 |
| LangChain | 135,082 | +933 |
| AutoGen | 57,482 | +255 |
| Flowise | 52,309 | +231 |
| CrewAI | 50,048 | +749 |
| LlamaIndex | 48,964 | +258 |
| LangGraph | 30,535 | +821 |
| Semantic Kernel | 27,789 | +41 |
| OpenAI Agents SDK | 25,385 | +1,777 |
| Haystack | 24,996 | +85 |
| Vercel AI SDK | 23,818 | +184 |
| Mastra | 23,352 | +188 |
| MS Agent Framework | 9,860 | +243 |
| Strands Agents | 5,712 | +42 |
Notable moves: OpenAI Agents SDK (+1,777) had another massive week, blowing past Vercel AI SDK (25,385 vs 23,818). Last issue they were 26 stars apart. Now the gap is 1,567. The sandbox and long-horizon harness launch is driving sustained adoption, not just a spike. CrewAI crossed 50K stars for the first time, extending its lead over LlamaIndex to 1,084 (50,048 vs 48,964). LangGraph (+821) continues outpacing parent LangChain (+933) on percentage: 2.76% vs 0.70%. OpenClaw's delta dropped again to +4,189, its new low, but at 365K total it's still nearly 2x n8n. Semantic Kernel (+41) and Strands Agents (+42) are both flat. MS Agent Framework (+243) is the only Microsoft repo showing real momentum.
NVIDIA's security team published a detailed analysis of AGENTS.md injection this week. The attack is simple: put malicious instructions in an AGENTS.md or CLAUDE.md file in a repo. When a coding agent clones the repo, it reads and follows those instructions automatically. No exploit needed. The agent is doing exactly what it was designed to do. This is the supply chain attack vector that's hiding in plain sight. Every coding agent that reads project config files is vulnerable. Claude Code reads CLAUDE.md. Cursor reads .cursorrules. Codex reads AGENTS.md. These files are trusted by default because they're supposed to contain project context. An attacker who gets a PR merged with a modified config file now controls every coding agent that touches that repo. The Antigravity sandbox escape this week proved the same pattern from a different angle: native tools that execute before security checks can evaluate them. We keep building agents that trust their environment, then acting surprised when the environment is hostile. The fix is the same one we learned with CI/CD pipelines a decade ago: treat every input as untrusted, including your own config files. Sandbox the config parser. Diff the instructions. Show the developer what the agent is about to follow before it follows it. Until that's standard, every git clone is a potential prompt injection.
How much of your AI-generated code survives into production?
| ✅ 75%+ ships as-is |
| 🔄 About half survives after edits |
| ✏️ I rewrite most of it |
| 🏗️ I use it as scaffolding only |
| 🚫 I don't use AI for code |
CertPrep
32,000+ practice questions for 106 certification exams from 22 vendors. AWS, Azure, GCP, CISSP, CCNA, Security+, CompTIA A+, Fortinet, Juniper, Kubernetes, Salesforce, SAP, Databricks and more. Timed practice tests, verified answers with detailed explanations for every option, bookmarks, progress tracking. Free tier for every exam. One-time purchase per exam, no subscriptions.
Download FreeWant to sponsor this newsletter? Get in touch
Like what you read?
Forward this to a friend who's building with agents.
Subscribe to The Agentic Engineer