Sakana AI Ships Marlin, Its First Paid Product: An Autonomous Research Agent

🧠 LAUNCH

Sakana AI Ships Marlin, Its First Paid Product: An Autonomous Research Agent

Sakana AI — founded by ex-Google Brain researchers — officially launches Marlin, an autonomous "Ultra Deep Research" agent built on their NeurIPS-spotlighted AB-MCTS work. Positioned as a "Virtual CSO" for deep research, it's a direct shot at Gemini Deep Research and Perplexity Pro. The pedigree is real and the timing is smart — shipping while frontier labs are busy firefighting. (240 likes | 39 RTs) Read more →

Claude Code Hackathon Winners Reveal Where the Ecosystem Is Heading

Anthropic showcases the winners of its Built with Opus 4.7 Claude Code hackathon, and the projects signal where agentic development is going: not toy demos but production-grade workflows built on tool use and multi-step reasoning. If you're wondering what patterns actually work in Claude Code, start here — the winning projects are a shortcut past months of experimentation. Read more →

Zhipu Drops GLM 5.2, Claims Opus 4.7 Parity. Another week, another Chinese frontier model. Zhipu AI released GLM 5.2 just hours after announcement, positioning it as Opus 4.7-class — the open-weight pipeline from China keeps compressing the gap with closed labs at startling speed. (297 likes | 7 RTs) Read more →

Microsoft Drops FastContext: A 4B Model Built for Speed. Microsoft ships FastContext-1.0-4B-SFT, a 4-billion-parameter model optimized for fast text generation on modest hardware. At 4B parameters it won't beat frontier models on reasoning, but it's an early signal that Microsoft is investing in small, efficient models alongside its GPT partnership. (100 likes | 13 downloads) Read more →

🔧 TOOL

OpenAI Ships a First-Party Codex Plugin — Treating Codex as a Platform. OpenAI launches an official plugin for Codex that sets up API keys, surfaces relevant docs, and debugs integrations inline. This isn't just convenience — it's OpenAI positioning Codex as a developer platform, not just a coding assistant. (602 likes | 37 RTs) Read more →

Claude Code v2.1.178: Fine-Grained Permission Rules and Monorepo Skills. New Tool(param:value) syntax lets you control exactly which tool calls get auto-approved — block Opus subagents while allowing Sonnet, for instance. Nested .claude/ directories now load skills contextually, which is a real unlock for monorepo workflows where different packages need different agent behaviors. Read more →

Codex CLI Adds Usage Tracking and One-Command Claude Code Import. Codex CLI 0.140.0 ships token usage dashboards, permanent session deletion, and — notably — one-command import of setup and project config from Claude Code. OpenAI is making it trivially easy to switch. (240 likes | 18 RTs) Read more →

tmux Plugin Manages Multiple Claude Code Sessions Across Repos. From craftzdog: a tmux plugin that lets you see which Claude Code sessions are done vs. still working and jump between them from a single popup. If you're running parallel agentic sessions across projects, this solves a real daily pain point. (102 likes | 3 RTs) Read more →

📝 TECHNIQUE

swyx: Ultracode Is Scarily Good — But Your Repo Has to Be Ready

The first detailed practitioner take on Claude Code's Ultracode multi-agent orchestration, and swyx doesn't hold back: "scarily good at burning tokens." His key insight is that the real unlock isn't just enabling it — it's structuring your repo so subagents can fan out in parallel without stepping on each other. Think of subagents as "intelligent subroutines" and design your codebase accordingly. If your monorepo is a tangled dependency graph, Ultracode will just burn cash. (350 likes | 19 RTs) Read more →

MCP vs CLI: Stop Overthinking It — Use Both. A clean framing that cuts through the noise: CLI for things the model already knows (git, npm, docker — trained on man pages, cheap in context), MCP for integrations the model can't reach natively (Slack, Notion, Linear). Stop debating which is "better" and audit which of your tools belongs where. (307 likes | 16 RTs) Read more →

Skip the Figma MCP Server — Just Give Your Agent Browser Access. A developer demonstrates that pointing a coding agent at Figma's window.figma plugin API via browser access unlocks full design automation — no dedicated Figma MCP server needed. An entire design-to-code workflow hidden behind one line of agent instruction. (88 likes | 1 RT) Read more →

🔬 RESEARCH

DeepMind Finds AI Models Inherit 'Strange Habits' From Predecessor Outputs. A Google DeepMind researcher demonstrates that models trained on outputs from previous-generation models pick up hard-to-filter behavioral quirks — explaining why models from the same family "feel" similar. The implication for synthetic data pipelines is serious: without diversity guardrails, you're compounding artifacts across generations. (333 likes | 21 RTs) Read more →

'Phantom Quantization': Why You Think Models Get Worse (Even When They Don't). A compelling thread documents what may be a novel psychological effect: users reliably perceive model quality declining with extended use, even when objective benchmarks stay flat. If this holds up, it means vibes-based model assessments are systematically biased — and every "they nerfed it" complaint needs a control group. (251 likes | 6 RTs) Read more →

💡 INSIGHT

Personality Clashes Sent Anthropic's Models Offline, Axios Reports

Simon Willison surfaces an Axios report claiming internal personality clashes at Anthropic contributed to pulling Fable and Mythos offline. If accurate, this reframes the suspension as partly an organizational failure — not purely the safety-driven decision Anthropic publicly described. The gap between the official narrative and the reported reality is widening, and trust is the casualty. Read more →

Anthropic Quietly Updated Its Privacy Policy One Day Before Fable Launched. Willison flags another data point: Anthropic added "verification data" language to its privacy policy on June 8th — one day before Fable shipped and four days before the US government action. Coincidence or anticipation? Either way, the timeline raises uncomfortable questions about what Anthropic knew and when. (207 likes | 10 RTs) Read more →

HN Reality Check: Can Local Models Actually Replace Claude for Daily Coding? A 600+ point Hacker News thread where developers share real experiences replacing cloud AI with local models. The consensus is sobering: local works for autocomplete and small tasks but still falls short on complex multi-file reasoning. A useful reality check for anyone rage-switching providers after the Fable suspension. (603 likes | 309 RTs) Read more →

Mollick: AI Is Ready for Moonshots — But They Need Public R&D, Not Just Startups. Ethan Mollick argues AI has reached the level where transformative public-good projects — universal tutoring, replication systems, remote medicine — are genuinely feasible. The catch: they require public R&D investment, consensus, and transparency, not just private-sector sprints. A timely argument when the frontier labs are consumed by competitive drama. (394 likes | 23 RTs) Read more →

🏗️ BUILD

Build a GPT-Style Transformer From Scratch — No High-Level Libraries. A repo that walks the full path from raw components — attention, multi-head attention, feed-forward blocks, embeddings, layer norm — to generated text with zero library abstractions. If you want to actually understand what's under the hood instead of just calling APIs, clone it and work through the notebooks. (354 likes | 49 RTs) Read more →

fusion-fable: Fuse Opus 4.8 + GPT-5.5 to Approximate Suspended Fable. With Fable suspended, developers are building workarounds. This Claude Code skill uses Opus 4.8 as the drafter and GPT-5.5 as the checker, then fuses the results — a practical cross-vendor arbitrage pattern for anyone who misses Fable-tier output. (225 likes | 31 RTs) Read more →

Qwen 3.6 40B GGUF Hits 376K Downloads as Local Model Demand Spikes. A community GGUF quantization of Qwen 3.6 40B is one of the hottest models on HuggingFace right now — 336 likes and 376K downloads. The surge in uncensored local model demand post-Fable is unmistakable. (336 likes | 376.0K downloads) Read more →

🎓 MODEL LITERACY

Model Collapse: Today's DeepMind research shows that AI models trained on outputs from predecessor models inherit subtle behavioral quirks — and this connects to a broader phenomenon called model collapse. When models are recursively trained on synthetic data from their own lineage, small artifacts get amplified across generations: distinctive phrasings become tics, biases become blind spots, and the output distribution narrows. This is why models from the same family often "feel" alike — they're literally inheriting each other's habits. For anyone building synthetic data pipelines, the takeaway is clear: diversity guardrails aren't optional, they're the only thing preventing your training loop from converging on a hall of mirrors.

⚡ QUICK LINKS

Anthropic Python SDK v0.109.2: Removes retired model IDs — audit your hardcoded strings. Link
Anthropic TypeScript SDK v0.104.2: Same retired-model cleanup, shipped same day. Link
Gemma 4 12B Coder Fine-Tune: Community model tuned for coding with Fable-style composition. (161 likes | 6.2K downloads) Link
Fata: Spaced repetition to fight the skill rot from offloading everything to AI. Link
Google Commits $1.5B to Alabama Data Center Expansion: The infrastructure arms race keeps accelerating. Link

🎯 PICK OF THE DAY

The persistent gap between measured model quality and perceived model quality isn't user error — it's a blind spot. The "phantom quantization" thread documents something that should unsettle anyone who evaluates AI models: users consistently perceive quality declining over time, even when benchmarks flatline. This isn't just complaining — it appears to be a systematic perceptual bias that grows worse as frontier models converge. When the objective gap between Claude, GPT, and Gemini narrows to a few percentage points, subjective "vibes" become the dominant evaluation signal — and those vibes are apparently unreliable by default. The implication is stark: every "they nerfed it" post, every model comparison thread driven by feel rather than measurement, is potentially distorted by this effect. For the industry, this means we're making provider decisions, shaping public narratives about model quality, and driving developer sentiment based on an evaluation method that may be fundamentally broken. Benchmark before you complain — and if you can't benchmark, at least acknowledge the bias. Read more →

Until next time ✌️