🎧
Listen to this issue
The Agentic Engineer | Issue #4

The Agentic Engineer

I read the repos so you don't have to.
Issue #4 | March 17, 2026

An autonomous agent hacked McKinsey's AI platform. No credentials, no human. Full read/write access to 46.5M chat messages and 57K employee accounts in 2 hours. The agent picked the target itself.

Amazon now requires senior engineer sign-off on all AI-assisted code. Production outages traced to AI-generated code forced the first formal agent governance policy at FAANG scale.

Claude gets 1M context at standard pricing. Opus 4.6 and Sonnet 4.6, no premium. 600 images per request. Fewer compactions, longer agent sessions.

An Autonomous Agent Hacked McKinsey in 2 Hours

McKinsey built an internal AI platform called Lilli. 43,000 employees use it. 500,000+ prompts a month. RAG over decades of proprietary research. The kind of system that handles strategy discussions, M&A analysis, and client financials.

CodeWall's autonomous offensive agent found a SQL injection in Lilli's API without credentials, without insider knowledge, and without a human in the loop. Two hours from cold start to full read/write access on the production database.

Here's the part that should keep you up at night: the agent autonomously selected McKinsey as a target. It read their responsible disclosure policy, confirmed it was in scope, and started mapping the attack surface on its own.

The attack surface was generous. Over 200 API endpoints were publicly documented. Most required auth. Twenty-two didn't. One of those unprotected endpoints wrote user search queries to the database. The values were parameterized correctly, but the JSON keys (the field names) were concatenated directly into SQL. Standard scanners like OWASP ZAP missed it entirely. The agent didn't.

Fifteen blind iterations later, production data started flowing back. What was inside:

  • 46.5 million chat messages, stored in plaintext
  • 728,000 files including 192K PDFs, 93K Excel spreadsheets, 93K PowerPoint decks
  • 57,000 user accounts
  • 3.68 million RAG document chunks with S3 storage paths
  • 95 AI model configurations across 12 model types, including system prompts

But the real nightmare is the write access. Lilli's system prompts were stored in the same database. An attacker could rewrite them with a single UPDATE statement. No deployment. No code change. No log trail. The AI just starts behaving differently, and 43,000 consultants keep trusting its output because it comes from their own internal tool.

Poisoned financial models. Subtly altered strategic recommendations. Data exfiltration baked into normal AI responses that users copy into client-facing documents. This is what prompt layer compromise looks like at enterprise scale.

SQL injection is one of the oldest bug classes in existence. McKinsey has world-class security teams and significant investment in the space. Lilli ran in production for over two years. Their internal scanners found nothing. An autonomous agent found it because it doesn't follow checklists. It maps, probes, chains, and escalates the way a skilled attacker would, but continuously and at machine speed.

Last week we covered Clinejection, where a GitHub issue title compromised 4,000 developer machines. That was agents as the target. This is agents as the attacker. Both sides of the equation are accelerating, and the defense side is not keeping up.

Source: CodeWall Blog · 503 points on HN

Amazon Requires Senior Engineer Sign-Off on All AI-Assisted Code

After production outages traced directly to AI-generated code, Amazon now mandates senior engineer approval for every AI-assisted change. This is the first FAANG company to formalize agent code governance at this scale. 657 points on HN, community split between "finally, someone said it" and "this kills velocity." The real question isn't whether your team needs this policy. It's whether you already needed it three months ago.

Source: Ars Technica

Claude 1M Context Window Now GA, No Premium

Opus 4.6 and Sonnet 4.6 get full 1M token context at standard pricing. No multiplier. 600 images and PDFs per request (up from 100). 78.3% on MRCR v2, highest among frontier models. Claude Code Max/Team/Enterprise users get 1M automatically. Early reports: 15% decrease in compaction events and full diffs that didn't fit in 200K now process in one pass. For anyone building agent loops, this means simpler harnesses and longer uninterrupted sessions.

Source: Claude Blog

The 8 Levels of Agentic Engineering

A practical progression framework from tab completion to full autonomy. The insight worth bookmarking: your output depends on your teammates' level. If your code reviewer operates at Level 2, your Level 7 background agents are bottlenecked at their throughput. The framework runs: tab complete, agent IDE, context engineering, compounding engineering, MCP/skills, multi-agent, background agents, full autonomy. Most teams are stuck between levels 2 and 4.

Source: bassimeledath.com · 276 points on HN

OneCLI: A Secret Vault Built for AI Agents

Rust gateway that sits between your agents and your APIs. Agents use fake keys. The gateway swaps in real credentials at request time. AES-256-GCM encrypted storage, per-agent scoped permissions, host/path matching. After reading the McKinsey story above, the "every agent has raw API keys" problem feels a lot more urgent. cargo install onecli and your agents never touch a real secret again.

Source: GitHub · 160 points on HN

RAG Document Poisoning: 3 Fake Docs Fooled the Entire Knowledge Base

A practical demo showed that 3 crafted documents injected into ChromaDB flipped a RAG system's financial report from $24.7M profit to $8.3M loss. The technique: an "authority framing" document posing as a CFO correction that references the real number as "superseded." Based on the PoisonedRAG paper (USENIX Security 2025). Full reproducible lab on GitHub. git clone && make attack1, runs locally, no GPU needed.

Source: aminrj.com

Many SWE-bench-Passing PRs Would Not Be Merged

METR · March 2026 · metr.org

The setup: METR recruited 4 active maintainers from 3 SWE-bench Verified repositories (scikit-learn, Sphinx, pytest). They reviewed 296 AI-generated pull requests that passed SWE-bench's automated grader. Maintainers were blinded to whether the PR was human or AI-written.

The finding: Roughly half of test-passing PRs would not be merged. Maintainer merge decisions averaged 24 percentage points lower than automated grader scores. The rate of improvement is also slower: 9.6 pp/yr less for maintainer merge decisions compared to automated grader scores.

Why PRs got rejected: Core functionality failures, patches that break other code, and code quality issues. The automated grader only checks if tests pass. Maintainers check if the code belongs in the codebase.

The nuance METR is careful about: They're not saying agents can't write mergeable code. Agents got one shot with no feedback. Human developers iterate through review cycles. Better prompting and elicitation could close the gap. The point is that SWE-bench scores don't map cleanly to "this agent can do X% of real engineering work."

Why it matters for builders: If you're evaluating coding agents by benchmark scores alone, you're overestimating their production readiness. The gap between "tests pass" and "a maintainer would ship this" is real and measurable. This pairs with Amazon's new sign-off policy: the industry is learning that AI code that works isn't the same as AI code that's good.

Methodology note: They normalized against a "golden baseline" of 47 original human-written PRs. Even those only got merged 68% of the time by the reviewing maintainers (different maintainer, different day, different standards). Human code review is noisy too.

Time saved: 4 min read vs 25 min paper. 6.3x compression.

Agent Browser Protocol (ABP)

github.com/theredsix/agent-browser-protocol · 155 points on HN

Web browsing is continuous and async. Agents think in discrete steps. This mismatch is why browser automation with LLMs has been painful. ABP fixes it by building the agent protocol directly into the browser engine.

ABP is a Chromium fork with MCP and REST baked in. The key innovation: it freezes JavaScript between agent steps so the page literally waits for the LLM to think. No race conditions. No stale screenshots. One HTTP request returns one settled page state, one screenshot, and one event log.

The numbers: 90.53% on Online Mind2Web (reproducible). ~100ms overhead per action including screenshots. The bottleneck is the LLM, not the browser.

Try it with Claude Code in 60 seconds:

# Add ABP as an MCP server
claude mcp add browser -- npx -y agent-browser-protocol --mcp

# Verify it's running
curl -s http://localhost:8222/api/v1/tabs

Then ask Claude: "Find me kung pao chicken near 415 Mission St, San Francisco on Doordash." Watch the page freeze between steps while Claude thinks. No Playwright. No CDP session management. Just HTTP.

Works with Codex and OpenCode too:

# Codex
codex mcp add browser -- npx -y agent-browser-protocol --mcp

# Or use the REST API directly
curl -s -X POST http://localhost:8222/api/v1/tabs/TAB_ID/navigate \
  -H 'content-type: application/json' \
  -d '{"url":"https://example.com","screenshot":{"format":"webp"}}'

If you've been duct-taping Playwright into agent loops and fighting async timing issues, ABP replaces all of that with a single deterministic protocol. The JS freeze alone is worth the switch.

Framework Star Tracker

Weekly star tracker, March 16, 2026. Deltas vs. Issue #3 (March 9).

FrameworkStarsWeekly Δ
OpenClaw316,289+29,855
n8n179,368+1,149
Dify133,005+1,292
LangChain129,704+973
AutoGen55,676+326
Flowise50,790+240
LlamaIndex47,699+193
CrewAI46,206+635
Semantic Kernel27,468+72
LangGraph26,517+569
Haystack24,520+83
Vercel AI SDK22,667+225
Mastra22,035+215
OpenAI Agents SDK20,032+581
Strands Agents5,311+26

Notable moves: OpenClaw continues its absurd climb past 316K stars, widening the gap over n8n at 179K. Mastra (22,035) has quietly overtaken OpenAI's own Agents SDK (20,032), which is notable given OpenAI's brand advantage. Strands Agents at 5,311 is still the newcomer to watch. The top 4 (OpenClaw, n8n, Dify, LangChain) are pulling away from the pack, with a 74K gap between LangChain at #4 and AutoGen at #5.

George Hotz posted "Stop Running 69 Agents" this week and it hit 715 points on HN. His core argument: "AI is not a magical game changer, it's the continuation of exponential progress." I think he's half right. The FOMO industrial complex around agents is real and toxic. But the McKinsey hack in this issue proves something geohot's post ignores: agents don't need to be magical to be dangerous. A single autonomous agent with a SQL injection scanner did more damage in 2 hours than most red teams do in 2 weeks. The hype is overblown. The capability is not. Stop running 69 agents. But make sure the one you're running can't be turned against you.

CertPrep

CertPrep

17,000+ practice questions across 49 certification exams — Azure, GCP, CISSP, CompTIA, Cisco, Kubernetes & more. Timed tests, detailed explanations for every option, progress tracking. Free tier for every exam. One-time purchase, no subscriptions.

Download Free

Want to sponsor this newsletter? Get in touch

Like what you read?

Forward this to a friend who's building with agents.

Subscribe to The Agentic Engineer
💬 Join the discussion