NewsletterBlogLearnCompareTopicsGlossary
RESEARCHTOOLLAUNCHINSIGHTTECHNIQUEBUILD

23 items covered

Cursor Unveils Origin — a Git Replacement Built for the Agent Era

🧠 LAUNCH

Cursor Unveils Origin — a Git Replacement Built for the Agent Era

Origin isn't another Git wrapper — it's a ground-up version control system designed for the world where dozens of AI agents are branching, merging, and conflicting simultaneously. Built by the Cursor and Graphite teams, it's extensible via API and MCP, with native merge conflict resolution for parallel agent sessions. If you've watched your agent workflows drown in merge conflicts and orphaned branches, this is the infrastructure response. Join the waitlist. (2,191 likes | 107 RTs) Read more →

Zhipu Drops GLM-5.2 Open Weights — Claims Opus 4.8 Parity With 1M Context

GLM-5.2 introduces an IS attention mechanism that reuses one indexer every 4 sparse layers for a 2.9x performance gain, and Zhipu claims parity with Opus 4.8 across major benchmarks — all with a 1M context window and open weights. China's open-weight labs continue closing the frontier gap at a pace that should make every closed-source provider uncomfortable. Download it and run your own evals before trusting the claims, but the trajectory is undeniable. (273 likes) Read more →

OpenAI Codex Gains Browser, Computer Use, and Memory Across the EU. Codex just leveled up for European users — Chrome automation, computer use, persistent memory, and Chronicle are now live in EU/EEA/UK. This turns Codex from a code-only assistant into a full desktop automation platform, and the EU rollout signals OpenAI is done waiting on regulatory clarity. (624 likes | 30 RTs) Read more →

Alibaba's Qwen Releases a Full Foundation Model Suite for Robotics. The Qwen team ships a purpose-built model suite for physical world intelligence — perception, planning, and control in one stack. Alongside NVIDIA's ENPIRE announcement, physical AI is having a breakout week with multiple major releases landing simultaneously. (112 likes | 17 RTs) Read more →

UK Government Taps DeepMind to Fix Housing Planning With AI. Britain's most intractable policy problem meets DeepMind — the UK government is partnering directly with Google's AI lab to prototype AI-powered housing planning decisions. A concrete case of frontier AI moving from research demos to actual government operations. Read more →


🔬 RESEARCH

NVIDIA's ENPIRE: 8 Codex Agents Autonomously Running a Robot Research Lab

Jim Fan demos the first fully autonomous AI research loop in the physical world — 8 Codex agents coordinate robots, GPUs, and token budgets end-to-end with no human intervention. Each agent owns a slice of the research pipeline: hypothesis generation, experiment design, robot execution, data analysis, and paper drafting. AutoResearch just graduated from simulation to real hardware, and the implications for scaling scientific discovery are staggering. (1,977 likes | 307 RTs) Read more →

Anthropic Publishes Its Framework for Measuring Claude Code Economics at Scale

Anthropic shares the first rigorous framework for tracking how Claude Code usage scales across users and task types — rare transparency into the real-world economics of agentic coding. The methodology tracks cost-per-task, completion rates, and escalation patterns across different engineering profiles. If you're budgeting for AI-assisted development, this is the spreadsheet you've been missing. (1,481 likes | 138 RTs) Read more →

OpenAI Simulates Deployment With Real User Requests Before Shipping Models. A clever pre-deployment safety method: run de-identified recent user requests through new models before release to anticipate real-world issues that lab evals miss. It's the difference between testing in a vacuum and testing in traffic. (1,483 likes | 122 RTs) Read more →

SkillsBench 1.1: The First Fully Audited Benchmark for AI Agent Skill Use. Benchmarks riddled with errors undermine every model comparison built on top of them. SkillsBench 1.1 is the first agent skill benchmark that's been audited end-to-end and verified error-free — setting a new standard for eval rigor at exactly the moment the industry needs it most. (51 likes | 16 RTs) Read more →


💡 INSIGHT

Willison: Fable 5 Export Controls Are Actively Weakening US Cyber Defense. Simon Willison makes a counterintuitive but well-sourced argument — by denying allied security researchers access to the most capable defensive AI tools, the Fable 5 export controls are creating more vulnerabilities than they prevent. Worth reading whether you agree or not. Read more →

Mollick: You Have 4–8 Months to Harden Systems Before Mythos-Class Goes Open. The math is simple: if open models lag closed-source by 8–12 months and Mythos-class capabilities are now demonstrated, organizations have a concrete 4–8 month window to harden their systems before those capabilities ship in open weights anyone can run. This isn't hypothetical — it's a calendar deadline. (496 likes | 33 RTs) Read more →

Satya Nadella on Loopcraft: How Microsoft Thinks About the AI Platform Layer. Nadella's "loopcraft" essay — the art of building AI ecosystems through stacked feedback loops — gets the Latent Space deep-dive. The strategic framing reveals how Microsoft sees itself not as an AI company but as the feedback-loop infrastructure underneath every AI company. Read more →

OpenAI's Evals Lead: Current Benchmarks Are Failing and Here's What Comes Next. OpenAI's frontier evals lead explains why current benchmarks either saturate or get gamed — and what model evaluation needs to look like when every lab is past the ceiling. Timely context for today's SkillsBench launch. (1,162 likes | 75 RTs) Read more →

Reports: GPT 5.6 and Gemini 3.5 Approaching Fable Parity at Half the Cost. Multiple sources suggest GPT 5.6 and Gemini 3.5 are approaching Fable-level performance at over 2x lower cost. If true, the Fable export suspension matters less than it did last week — equivalent capability from multiple providers is weeks away, not months. (730 likes | 33 RTs) Read more →

OpenAI Commits $160K to Astral and Codex Open-Source Maintainers. OpenAI is funding the tools its products depend on — $160K to Astral (ruff, uv) and Codex toolchain maintainers, alongside a $1M fund for free Codex access. AI labs are now competing on developer ecosystem investment, not just model quality. (270 likes | 19 RTs) Read more →


📝 TECHNIQUE

Claude Code's Creator: "My Entire CLAUDE.md Is Two Lines." Boris Cherny — the person who built Claude Code — argues most engineers are massively over-engineering their CLAUDE.md configs. His take: with better models, you need less instruction, not more. A contrarian position from the one person who would know, and a good prompt to audit whether your 200-line config is helping or just adding noise. (112 likes | 9 RTs) Read more →

If you're curious about the different ways to run Claude Code, we recently compared the desktop vs. terminal experience.


🔧 TOOL

Claude Code v2.1.179: Mid-Stream Connection Fix and Sandbox Performance. Three quality-of-life fixes in one release — mid-stream connection drops now preserve partial responses instead of losing them, WSL2 mouse scrolling works again, and a sandbox performance issue that bloated tool descriptions on large directory trees is resolved. Update now. Read more →


🎓 MODEL LITERACY

Benchmark Saturation: When a benchmark's leaderboard clusters within a few percentage points of perfect scores, it's "saturated" — the test can no longer distinguish between models or meaningfully measure improvement. Today's news illustrates this perfectly: OpenAI's evals lead publicly declares current benchmarks broken, while SkillsBench 1.1 launches as the first fully audited agent benchmark. The problem isn't just high scores — it's that models learn to exploit benchmark-specific patterns (data contamination, format hacking) that don't reflect real capability. When you see a model claim "95% on X benchmark," ask two questions: what's the ceiling, and when was the test last refreshed? If the answer is "97%" and "two years ago," that number tells you almost nothing.


⚡ QUICK LINKS

  • Sarvam AI: India's latest AI unicorn — $234M raise at $1.5B valuation, backed by HCL Tech. (83 likes) Link
  • Ollama v0.30.9: Adds Cohere2Moe support and proper errors when messages exceed context windows. Link
  • VibeThinker-3B: Weibo enters the model game with a 3B reasoning model for edge deployment. (164 likes) Link
  • GPT-NL: The Netherlands launches a sovereign language model — post-Fable ban, European governments are building their own. (125 likes | 128 RTs) Link
  • datasette-agent 0.3a0: Willison ships agentic data exploration — point an LLM at a SQLite database and let it query conversationally. Link
  • datasette-tailscale 0.1a0: Zero-config Tailscale auth for internal datasette instances. Pairs perfectly with datasette-agent. Link

🎯 PICK OF THE DAY

Origin isn't just another dev tool — it's the first admission that AI agents have outgrown the infrastructure humans built for themselves. Cursor and Graphite unveiling a Git replacement purpose-built for agent workflows sounds like a niche developer tool story, but zoom out: version control is the bedrock of how software gets built, and the fact that a well-funded team concluded Git is fundamentally incompatible with agent-native development is a signal worth taking seriously. Git was designed for human-speed collaboration — pull, review, merge, repeat. When you have 20 agents branching simultaneously, resolving conflicts at machine speed, and needing API-level extensibility rather than CLI ergonomics, Git's 20-year-old assumptions crack. Origin's bet is that version control is merely the first domino. If agents need their own VCS, they'll need their own CI, their own code review, their own deployment pipelines. We're watching the beginning of a full-stack rebuild of developer tooling around agent-native workflows — and the companies that own that new stack will own the next era of software development. (2,191 likes | 107 RTs) Read more →


Until next time ✌️