The Agentic Engineer
I read the repos so you don't have to.
Issue #10 | April 29, 2026
TL;DR
- GPT-5.5 is here. 82.7% on Terminal-Bench 2.0, fewer tokens per task than 5.4, and it beats Opus 4.7 on most agentic benchmarks. OpenAI just reclaimed the top of the leaderboard.
- Stanford analyzed 6,000 real coding agent sessions. 44% of agent-produced code gets thrown away. 41% of sessions are pure vibe coding. The first empirical reality check on how agents are actually used.
- ToolSimulator in the Strands Evals SDK lets you test agents against realistic mock tools without live API calls.
pip install strands-evalsand stop choosing between brittle mocks and production risk.
The Big One
GPT-5.5: OpenAI Reclaims the Agentic Crown
OpenAI dropped GPT-5.5 on April 23. The benchmark numbers tell the story: 82.7% on Terminal-Bench 2.0, 78.7% on OSWorld-Verified, 84.9% on GDPval. It beats Claude Opus 4.7 on nearly every agentic benchmark we track.
The efficiency angle matters more than the raw scores. GPT-5.5 uses fewer tokens to complete the same Codex tasks as GPT-5.4 while matching its per-token latency. Cheaper and better is a rare combination in this space.
Read the full issue
Subscribe to The Agentic Engineer to unlock this and every future issue.