GPT-5.5 arrives: OpenAI's new intelligence class built for agents

🧠 LAUNCH

GPT-5.5 arrives: OpenAI's new intelligence class built for agents.

OpenAI's biggest model drop since GPT-5 — GPT-5.5 introduces what they're calling a new intelligence class, purpose-built for agentic coding, computer use, and scientific research. The model ships with improved tool use, self-checking, and task completion capabilities that position it squarely as an agent backbone, not just a chat model. Available immediately in ChatGPT and Codex. The 35K likes on the announcement thread tell you the developer community noticed. (35,550 likes | 4,923 RTs) Read more →

Claude connects to Tripadvisor, Spotify, Instacart, and a dozen more services.

Claude just became a personal assistant — new connectors link it to Tripadvisor, Booking.com, Resy, Instacart, Spotify, Audible, AllTrails, Thumbtack, TurboTax, and more. This isn't a gimmick: it's Anthropic's bet that the next moat isn't benchmarks but how deeply an AI integrates into your actual life. Connect your apps in Claude settings and watch the use case shift from "help me code" to "help me live." (8,222 likes | 475 RTs) Read more →

Claude Managed Agents gain persistent memory in public beta. Agents now learn from every session with an intelligence-optimized memory layer — your managed agent remembers context across conversations without you wiring up a database. SDKs already ship with support. (3,746 likes | 250 RTs) Read more →

Gemini Embedding 2 hits GA with native multimodal search. Google's first natively multimodal embedding model is generally available in the Gemini API and Vertex AI — search across text, images, and video with a single embedding space. If you're building RAG pipelines, this is your new baseline to benchmark against. (2,782 likes | 297 RTs) Read more →

OpenAI releases free healthcare ChatGPT for clinicians. A specialized, free ChatGPT version built on GPT-5.4 that beat specialty-matched physicians on hard clinical tasks. This is OpenAI's first vertical AI product — a signal that the platform play is shifting from horizontal to domain-specific. (637 likes | 57 RTs) Read more →

🔧 TOOL

Codex gets browser use: from code writer to full-stack agent.

Codex now interacts with web apps, tests user flows, clicks through pages, captures screenshots, and iterates on what it sees — all powered by GPT-5.5. This is the jump from "writes code" to "uses the thing it built." If you're still manually QA-ing your web app while your coding agent watches, that era just ended. (2,944 likes | 239 RTs) Read more →

Claude Code v2.1.119: persistent config, agent worktrees, vim visual mode. Two releases in quick succession — v2.1.118 adds vim visual mode and custom themes, v2.1.119 adds persistent /config settings, prUrlTemplate, and agent worktrees for isolated subagent work. Rapid iteration following last week's quality post-mortem. Read more → For a deeper look at Claude Code's model options and configuration, see our model options guide.

Anthropic SDKs ship same-day support for Managed Agent memory. Both Python (v0.97.0) and TypeScript (v0.91.0) SDKs landed with full support for the new memory API the same day the feature went public beta. The speed of SDK parity here is notable — the API surface for persistent agent memory is ready to build on today. Read more →

🔬 RESEARCH

DeepMind's Decoupled DiLoCo trains across flaky data centers without missing a beat. Frontier models are outgrowing single-cluster capacity, and DeepMind's answer is local SGD that tolerates network failures and cross-datacenter latency. If this scales, whoever has the most GPUs wins — even if they're scattered across continents. (956 likes | 129 RTs) Read more →

Vision Banana: image generators are secretly general-purpose vision models. DeepMind demonstrates that diffusion-based image generators can serve as strong vision encoders — collapsing the wall between generation and understanding. The implication: you may not need separate models for creating and analyzing images. (559 likes | 82 RTs) Read more →

Sony's autonomous ping-pong robot achieves expert-level play, published in Nature. An RL-powered robot using vision sensors plays table tennis at expert level, reacting in milliseconds. Published in Nature, this is a landmark for real-time physical AI — the kind of control loop that doesn't get to "think for 30 seconds." (510 likes | 62 RTs) Read more →

LLaDA 2.0-Uni: a diffusion-based language model goes multimodal. Instead of predicting the next token, this model denoises entire sequences — and now handles both text and images. A fundamentally different architecture from autoregressive transformers, trending on HuggingFace. Worth watching as a potential fork in the LLM evolutionary tree. (140 likes | 8 downloads) Read more →

💡 INSIGHT

Anthropic opens the post-mortem playbook on Claude Code quality regression.

Anthropic published a transparent post-mortem on recent Claude Code quality drops — and the finding is a gut punch for the benchmark industry: infrastructure configuration caused performance swings larger than the gap between top models on leaderboards. All affected users got rate limits reset. The honest accounting here is refreshing, but the implication is uncomfortable: how many "model improvements" on public leaderboards are actually measuring server load? (2,181 likes | 85 RTs) Read more →

Simon Willison: "Within two years you'll be able to prompt-inject an entire country." As AI agents gain real-world permissions — booking flights, managing finances, controlling infrastructure — the injection surface grows from "tricks a chatbot" to "manipulates systems at national scale." Willison's timeline may be aggressive, but the direction is right. (1,197 likes | 121 RTs) Read more →

Abacus AI moves production workloads to open-source Kimi 2.6. The first public report of a company migrating real production traffic from closed to open-source models. If Kimi 2.6 holds up under production load, it validates the cost thesis that's been theoretical until now. (363 likes | 25 RTs) Read more →

Redis creator Antirez calls GPT-5.4 the strongest LLM for systems programming. Antirez has been using GPT-5.4 as his primary tool for systems-level code and is testing GPT-5.5 on day one. When the person who built Redis tells you which model writes the best C, you listen. (306 likes | 7 RTs) Read more →

📝 TECHNIQUE

Gemini 3.1 TTS introduces inline audio tags for vocal style control. Google's latest TTS model lets you drop tags like [whispers], [screams], and pace directives directly into your prompt text. It's a new prompting pattern for speech generation — instead of global voice settings, you control delivery at the sentence level. If you're building voice interfaces, this is the granularity you've been missing. (319 likes | 39 RTs) Read more →

🏗️ BUILD

Agent Vault: open-source credential management for AI agents. As agents gain tool access to APIs, databases, and services, credential management becomes a real attack surface. Infisical's open-source vault provides a dedicated security layer for agent authentication — filling a gap that most agent frameworks hand-wave past. (53 likes | 14 RTs) Read more →

1,000+ Claude Code extensions in one community registry. A community-built collection of agents, skills, commands, MCPs, and hooks — all installable with a single command. The Claude Code ecosystem is growing faster than any official marketplace could curate. Browse it for subagent patterns you haven't tried yet. (297 likes | 33 RTs) Read more →

🎓 MODEL LITERACY

Decoupled Distributed Training (DiLoCo): Training a frontier model usually means thousands of GPUs in a single data center, connected by ultra-fast networking. DeepMind's DiLoCo approach breaks that constraint — each cluster trains independently using local SGD, then periodically syncs a compressed update with the others. The "decoupled" part means if one cluster drops offline or slows down, the others keep training without it. Why this matters: as models outgrow single-cluster capacity, whoever can reliably train across multiple unreliable data centers gets to build bigger models — and DiLoCo turns geography from a bottleneck into a non-issue.

⚡ QUICK LINKS

OpenAI's GPT-5.5 announcement thread: The numbers speak — 35K likes in hours. (35,550 likes | 4,923 RTs) Link
Claude Code stops over-calling Grep and Glob: After v2.1.117, noticeably faster file search behavior. (1,383 likes) Link
llama.cpp patches critical heap-buffer-overflow: CVSS 8.8 — if you run llama.cpp as a server, update to b8908 now. Link
Anthropic's Cat Wu on Claude Code velocity: Lenny's Podcast interview on how the PM role shifts in the AI era. (124 likes) Link

🎯 PICK OF THE DAY

When infrastructure noise produces benchmark swings larger than the gap between frontier models, the entire leaderboard industry is measuring config drift — not intelligence. Anthropic's post-mortem on Claude Code quality regression is the most important read today, and not because of the bug itself. The core finding — that infrastructure configuration caused performance swings exceeding the gap between top models on public benchmarks — should make every AI team rethink how they evaluate models. We've built an entire industry around leaderboard rankings where a 2% difference crowns a winner, but if server load, network latency, and batch scheduling can swing scores by more than that, what exactly are we measuring? The honest transparency here is commendable (rate limits reset, root cause published), but the deeper signal is uncomfortable: the model evaluation paradigm that drives billions in API revenue may be built on noise, not signal. If you're choosing your AI provider based on benchmark deltas, you're probably choosing based on whose infrastructure had a better day. Run your own evals, on your own data, on your own infra. Everything else is astrology with GPUs. Read more →

Until next time ✌️