AI-narrated by Amazon Polly • The Agentic Engineer

The Agentic Engineer

I read the repos so you don't have to.
Issue #8 | April 15, 2026
  • Anthropic shipped Managed Agents, a hosted service that virtualizes agent sessions, harnesses, and sandboxes into swappable interfaces. The "pet container" problem is dead. p50 time-to-first-token dropped 60%.
  • UC Berkeley broke every major AI agent benchmark without solving a single task. 100% on SWE-bench Verified with a 10-line conftest.py. The scorecard the industry uses to pick models is exploitable.
  • AWS Agent Registry (our Tool of the Week) launches in Bedrock AgentCore preview. Think npm registry for AI agents: centralized discovery, versioning, governance, and compliance metadata across your org.

Anthropic Managed Agents: Your Agent Infra Is Now an OS

Every team building production agents hits the same wall. You put the model, the harness, and the sandbox in one container. The container becomes a pet. If it dies, the session is gone. If it hangs, you can't debug it because user data lives in the same box. Anthropic just shipped the fix.

Managed Agents virtualizes three agent components into independent interfaces: a session (append-only event log), a harness (the Claude loop that routes tool calls), and a sandbox (code execution). Each can fail or be replaced without disturbing the others. The analogy is deliberate: operating systems virtualized hardware into process and file abstractions. Managed Agents does the same for agent infrastructure.

The architecture decouples the "brain" (Claude + harness) from the "hands" (sandboxes and tools). The harness calls the container the way it calls any tool: execute(name, input) → string. If a container dies, the harness catches it as a tool-call error. Claude decides whether to retry. A new container spins up from a standard recipe. No more nursing failed containers back to health.

Harness crashes are recoverable too. The session log sits outside the harness, so nothing in the harness needs to survive a crash. A new harness boots with wake(sessionId), fetches the event log via getSession(id), and resumes from the last event. Cattle, not pets, all the way down.

The security model is the most interesting part. In the old coupled design, untrusted code ran in the same container as credentials. A prompt injection only had to convince Claude to read its own environment. Managed Agents puts OAuth tokens in a vault proxy that the sandbox never touches. Git tokens get baked into the repo clone during initialization. Claude pushes and pulls without ever handling the token. MCP tools route through a dedicated proxy that fetches credentials from the vault per-session.

Sessions also solve the context window problem differently. Instead of irreversible compaction (summarize and discard), the session log stores all events durably outside Claude's context window. The harness can interrogate it with getEvents(), selecting positional slices, rewinding before specific moments, or rereading context before an action. Context engineering happens in the harness layer, not in the model's memory.

Performance results: decoupling brain from hands means containers only spin up when needed. Sessions that never touch the sandbox skip container setup entirely. Anthropic reports p50 TTFT dropped roughly 60% and p95 dropped over 90%.

What the post doesn't address: multi-tenant isolation when multiple brains share sandbox pools, and cold-start latency for the first container in a session. Builders running latency-sensitive workflows will need to benchmark the warm-up cost against the TTFT gains.

Source: Anthropic Engineering Blog

Cursor 3: The Agent-First IDE Rebuild

Cursor shipped a ground-up rebuild codenamed Glass. Not an editor with AI bolted on. A workspace where developers manage teams of coding agents. Multiple local and cloud agents appear in a single sidebar, including those triggered from mobile, Slack, GitHub, and Linear. Design Mode lets you select UI elements and describe changes in plain language. A /best-of-n command sends prompts to multiple LLMs and lets you compare outputs. The competitive context: Claude Code holds roughly 54% of the AI coding market per Menlo Ventures. Cursor is betting the next phase belongs to whoever manages agents at scale, not whoever writes the most code.

Source: Tea4Tech

Copilot "Rubber Duck": Cross-Model Second Opinions in the CLI

GitHub Copilot CLI now routes your code through a second model family before you commit. Called Rubber Duck: one model writes, another from a different AI family reviews. Claude Sonnet + GPT-5.4 Rubber Duck closes 74.7% of the performance gap between Sonnet and Opus on SWE-Bench Pro. The biggest gains show up on hard problems spanning 3+ files and 70+ steps, where the cross-model review scores 4.8% higher than baseline. Activates automatically after planning and complex implementation, or on demand. Access via /experimental in Copilot CLI.

Source: GitHub Blog

Claude Code Source Leaked via npm Source Map

Anthropic shipped a .map file in their npm package. The source map referenced the complete, unobfuscated TypeScript source on Anthropic's R2 storage. Community archived the entire codebase within hours. What it revealed: ~1,900 TypeScript files, 512,000+ lines, roughly 40 discrete tools with permission gating, a 46,000-line query engine, multi-agent coordination, and unreleased features behind flags. KAIROS is an autonomous daemon mode. BUDDY is a Tamagotchi-style AI companion. "Undercover Mode" prevents Claude from revealing internal info when contributing to open-source repos. Anthropic called it "a release packaging issue caused by human error." The fix is one line in .npmignore.

Source: InfoQ

Dependabot Alerts Now Assignable to AI Agents

GitHub now lets you route Dependabot vulnerability alerts directly to coding agents for automated remediation. Not just version bumps. Agents analyze the advisory, make code changes across the project, and open a draft PR. You can assign multiple agents to the same alert and compare their approaches. Supports Copilot, Claude, and Codex. The practical win: major version upgrades that break APIs, package downgrades when a dependency is compromised, and complex fixes outside Dependabot's rule-based engine. Requires GitHub Code Security and a Copilot plan with coding agent access.

Source: GitHub Changelog

Microsoft Agent Framework Hits 1.0 GA

Follow-up to Issue #7's RC coverage. Microsoft Agent Framework is now 1.0 GA with stable APIs for .NET and Python, native MCP tool integration, and multi-agent orchestration. AutoGen is officially in maintenance mode. The Agent Index tells the story: AutoGen (+297) and Semantic Kernel (+41) are both flat as developers consolidate onto the new framework. Microsoft's production agent bet is locked in.

Source: Visual Studio Magazine

Q+: Making Search Agents Think Before They Search

arxiv.org/abs/2604.07927

Core insight: Most deep research agents freestyle their web searches. They fire queries, scan results, and hope the right evidence shows up. Q+ makes that process deliberate by adding explicit tools for query planning, progress monitoring, and evidence extraction.

How it works: Q+ is a set of structured reasoning tools integrated into Eigent, an open-source multi-agent system. Instead of letting the agent decide when and how to search implicitly, Q+ forces three explicit steps: plan queries before executing them, track what evidence has been found vs. what's still missing, and extract specific facts from long web pages rather than summarizing entire documents. The design is inspired by Anthropic's "think" tool, which showed that making reasoning explicit improves output quality.

The numbers: Across four benchmarks (SimpleQA-Verified, FRAMES, WebWalkerQA, X-Bench DeepSearch), Q+ improved accuracy by 0.6 to 3.8 percentage points depending on the model backend. GPT-4.1 saw the largest gain (+3.8pp). GPT-5.1 gained +3.0pp. Minimax M2.5 gained +0.6pp. The smaller the model, the more structured reasoning helps.

Why builders should care: If your agent does web research, the bottleneck probably isn't the model. It's the search strategy. Agents that freestyle their queries waste tokens on redundant searches and miss evidence that a structured plan would have caught. Q+ is open-source and pluggable into any multi-agent system.

Practical gap: The paper tests on factual QA benchmarks where answers are verifiable. Real-world research often involves synthesizing conflicting sources, weighing credibility, and handling ambiguity. Q+'s structured approach helps with the first problem (finding evidence) but doesn't address the second (evaluating it).

Time saved: 6 min read vs 42 min paper. 7.0x compression.

AWS Agent Registry: npm for Your AI Agents

AWS Agent Registry in Bedrock AgentCore · Preview

You have 50 agents across 4 teams. Nobody knows what exists. Teams rebuild capabilities that already shipped. Compliance can't track what's running. AWS Agent Registry solves this with a centralized catalog for discovering, versioning, and governing AI agents across your org.

The registry stores structured metadata for every agent, tool, MCP server, and agent skill. It captures who published each record, what protocols it implements, what it exposes, and how to invoke it. Supports MCP and A2A natively, with custom schemas for org-specific resources.

Two ways to register:

# Option 1: Manual registration via AWS CLI
aws bedrock-agentcore create-registry-entry \
  --name "payment-processor-agent" \
  --resource-type AGENT \
  --protocol MCP \
  --description "Handles payment processing and refunds" \
  --owner "payments-team" \
  --compliance-status APPROVED

# Option 2: Auto-discovery from MCP/A2A endpoint
aws bedrock-agentcore register-from-endpoint \
  --endpoint-url "https://internal.example.com/mcp/payments" \
  --protocol MCP

Discovery: Hybrid search combines keyword and semantic matching. A search for "payment processing" surfaces tools tagged as "billing" or "invoicing" even if named differently. The registry is accessible through the AgentCore Console, APIs, and as an MCP server. Any MCP-compatible client can query it directly, including Kiro and Claude Code. OAuth-based access means teams can build custom discovery UIs without IAM credentials.

Governance: Approval workflows control what gets published. Compliance metadata tracks ownership, status, and usage documentation. The registry indexes agents regardless of where they run: AWS, other clouds, or on-premises. Not just an AWS-only catalog.

The practical gap: this is preview, not GA. No pricing announced. The CLI commands above are based on the blog post's described capabilities. Expect the API surface to shift before general availability. Also missing: automated drift detection between registered metadata and actual agent behavior in production.

Status: Preview in Amazon Bedrock AgentCore (us-east-1)

Framework Star Tracker

Weekly star tracker, April 15, 2026. Deltas vs. Issue #7 (April 8, 2026).

FrameworkStarsWeekly Δ
OpenClaw356,077+6,622
n8n183,800+1,155
Dify137,538+1,201
LangChain133,370+868
AutoGen57,031+297
Flowise51,849+262
CrewAI48,763+637
LlamaIndex48,540+213
LangGraph29,117+605
Semantic Kernel27,698+41
Haystack24,822+96
Vercel AI SDK23,454+174
Mastra22,940+220
OpenAI Agents SDK20,747+151
Strands Agents5,612+60

Notable moves: n8n (+1,155) broke four digits for the first time since we started tracking, likely riding the wave of no-code agent builder interest. Dify (+1,201) had another strong week, outpacing n8n on absolute gains. CrewAI (+637) continues pulling away from LlamaIndex (+213), now leading by 223 stars (48,763 vs 48,540). LangGraph (+605) keeps outpacing parent LangChain (+868) on a percentage basis: 2.1% vs 0.65%. OpenClaw's delta came in at +6,622, down from +8,316 last week but still dominant at 356K total. The Microsoft consolidation story is clear: AutoGen (+297) and Semantic Kernel (+41) are both flat as Agent Framework 1.0 GA absorbs the developer base. Strands Agents (+60) slowed after last week's +101 best-ever, settling at 5,612.

UC Berkeley broke every major AI agent benchmark this week. 100% on SWE-bench Verified with a 10-line pytest hook. 100% on Terminal-Bench by trojaning curl. ~100% on WebArena by navigating to file:// URLs that read the answer key. Zero tasks solved. Zero LLM calls in most cases. The benchmarks that companies cite in press releases, that investors use to justify valuations, that engineers use to pick models: all exploitable. A conftest.py that rewrites every test result to "passed" is not a sophisticated attack. It's a packaging oversight that nobody checked for because the evaluation harness runs in the same container as the agent's code. SWE-bench, the benchmark that launched a thousand funding rounds, can be aced by anyone who knows what pytest hooks are. The fix isn't complicated: run evaluation in a separate, hardened environment that the agent can't touch. But the deeper problem is cultural. We've been treating benchmark scores as ground truth for capability when they're actually measuring "can this system produce output that a specific grading script accepts." Those are very different things. Every agent evaluation you run should now include an adversarial audit of the grading pipeline itself. If your benchmark can be beaten by 10 lines of Python, it's not a benchmark. It's a formality.

CertPrep

CertPrep

32,000+ practice questions for 106 certification exams from 22 vendors. AWS, Azure, GCP, CISSP, CCNA, Security+, CompTIA A+, Fortinet, Juniper, Kubernetes, Salesforce, SAP, Databricks and more. Timed practice tests, verified answers with detailed explanations for every option, bookmarks, progress tracking. Free tier for every exam. One-time purchase per exam, no subscriptions.

Download Free

Want to sponsor this newsletter? Get in touch

Like what you read?

Forward this to a friend who's building with agents.

Subscribe to The Agentic Engineer
💬 Join the discussion