🔒Subscribe to listen to this issue

The Agentic Engineer

I read the repos so you don't have to.
Issue #10 | April 29, 2026
  • GPT-5.5 is here. 82.7% on Terminal-Bench 2.0, fewer tokens per task than 5.4, and it beats Opus 4.7 on most agentic benchmarks. OpenAI just reclaimed the top of the leaderboard.
  • Stanford analyzed 6,000 real coding agent sessions. 44% of agent-produced code gets thrown away. 41% of sessions are pure vibe coding. The first empirical reality check on how agents are actually used.
  • ToolSimulator in the Strands Evals SDK lets you test agents against realistic mock tools without live API calls. pip install strands-evals and stop choosing between brittle mocks and production risk.

GPT-5.5: OpenAI Reclaims the Agentic Crown

OpenAI dropped GPT-5.5 on April 23. The benchmark numbers tell the story: 82.7% on Terminal-Bench 2.0, 78.7% on OSWorld-Verified, 84.9% on GDPval. It beats Claude Opus 4.7 on nearly every agentic benchmark we track.

The efficiency angle matters more than the raw scores. GPT-5.5 uses fewer tokens to complete the same Codex tasks as GPT-5.4 while matching its per-token latency. Cheaper and better is a rare combination in this space.

Read the full issue

Subscribe to The Agentic Engineer to unlock this and every future issue.