When the Cache Breaks

New Model Releases & Benchmarks

The local AI ecosystem is having a moment. While the big labs jostle for position on benchmarks and ARR, this week's real action is at the infrastructure layer: llama.cpp just unlocked audio processing for Gemma 4, speculative decoding is delivering 50% speedups on consumer GPUs, and a solo developer claims to have cracked 1M-token context windows on a $600 graphics card. The MiniMax M2.7 quant ecosystem is maturing fast, with early Mac benchmarks showing genuine "Sonnet at home" potential. Meanwhile, OpenAI's Spud remains the most anticipated vaporware in AI, with Polymarket giving it 78% odds of shipping by month's end.

llama.cpp Lands Audio Processing with Gemma 4

The llama.cpp project has merged support for audio processing via Gemma 4's E2A and E4A models, adding a USM-style Conformer encoder to llama-server. This means local speech-to-text is now available through the same inference stack that powers most local LLM deployments. The implementation supports automatic speech recognition and speech-to-translated-text across multiple languages, all running entirely on-device. Combined with Gemma 4's existing vision capabilities, llama-server is quietly becoming a full multimodal inference platform.

Why it matters: Local multimodal inference just crossed another threshold. Audio was the last major modality missing from the open-source stack, and its arrival in llama.cpp means developers can build voice-enabled AI applications without any cloud dependency.

Speculative Decoding Hits +50% on Code with Gemma 4 E2B Draft

A community benchmark on an RTX 5090 shows speculative decoding with Gemma 4 E2B (4.65B) as a draft model for Gemma 4 31B delivers a 29% average speedup across tasks and a 50% boost on code generation. The setup runs entirely on a single consumer GPU with 32GB VRAM. This validates Google's MoE architecture design, where the smaller "E" variants share enough of the larger model's distribution to serve as effective draft models.

Why it matters: Speculative decoding has long been theoretically appealing but practically finicky. These results suggest Google's Gemma 4 model family was designed with spec-dec in mind, making it the first open model ecosystem where the technique "just works" out of the box.

KIV: 1M-Token Context on 12GB VRAM

A developer released KIV (K-Indexed V Materialization), a middleware layer that replaces HuggingFace's standard DynamicCache with a tiered retrieval system. It keeps recent tokens exact in VRAM, offloads older keys and values to system RAM and disk, and materializes them on demand. The author claims 1M-token context windows on an RTX 4070 with just 12GB VRAM, working as a drop-in replacement for any model using DynamicCache. If the accuracy claims hold up under scrutiny, this could democratize long-context inference.

Why it matters: Long context has been gated by VRAM. A working, accuracy-preserving tiered cache would let consumer hardware handle document-scale tasks currently reserved for cloud APIs or high-end workstations.

Update: MiniMax M2.7 Quants Go Live, Mac Benchmarks Impress

Following MiniMax M2.7's official release covered April 12, the Unsloth team has uploaded a full range of GGUF quants from Q1 to BF16. Early Mac benchmarks are turning heads: the 63GB quant scores 88% on MMLU 200-question, while the 89GB version hits 95%. Users report the M5 Max should achieve roughly 50 tok/s, putting this firmly in "Sonnet 4.5 at home" territory. Community members are already running parallel sub-agents locally through Opencode on Apple Silicon.

Why it matters: A 230B MoE model running locally at near-frontier quality with usable speeds represents a new high-water mark for local inference. The sub-agent use case, with multiple parallel streams, shows these models are practical for real agentic workflows.

Update: OpenAI's "Spud" Enters Final Stretch

Building on our April 10 coverage of Spud's pretraining completion, the model is now deep in safety evaluation and red-teaming. Polymarket assigns 78% probability of release by April 30. Sam Altman has told employees it is a "very strong model" that could "really accelerate the economy," while Greg Brockman called it "two years of research" and "a significant change in the way we think about model development." Whether it ships as GPT-5.5 or GPT-6 reportedly depends on internal benchmarking against GPT-5.4. No architecture paper, parameter count, or pricing has been disclosed.

Why it matters: If Spud lands this month with the capabilities OpenAI is hinting at, it could reset the frontier benchmark landscape just as Anthropic's Mythos controversy and Gemini 3.1's strong showing have made the race feel genuinely competitive.

Research Papers & Breakthroughs

A quieter day on the research front, but two threads are worth tracking. New work on KV cache offloading reveals that the popular memory-saving technique breaks down on tasks that actually need long context, which is precisely when you'd want it most. And the Mythos skepticism cycle deepened with a detailed Tom's Hardware investigation challenging the statistical basis of Anthropic's zero-day claims. The pattern here is one of reality-checking: the community is getting better at stress-testing headline claims.

KV Cache Offloading Fails on Context-Intensive Tasks

A new paper, KV Cache Offloading for Context-Intensive Tasks (arXiv:2604.08426), evaluates modern KV-cache offloading strategies on problems that require looking up significant information from the input prompt. The authors introduce Text2JSON, a benchmark requiring structured knowledge extraction from raw text, and find "significant performance degradation" on both Llama 3 and Qwen 3 models when offloading is enabled. The analysis traces the failures to two causes: low-rank projection of keys and unreliable landmarks. The paper proposes a simpler alternative strategy that improves accuracy across multiple model families.

Why it matters: KV cache offloading is central to the long-context story for local inference (see KIV above). This paper reveals a fundamental tension: the tasks where you most need long context are exactly the tasks where current offloading techniques lose accuracy. Any practical solution will need to address this head-on.

Update: Mythos Zero-Day Claims Face Detailed Statistical Challenge

Following our April 12 coverage of growing Mythos scrutiny, a Tom's Hardware investigation has laid out the full statistical argument. Anthropic's claim of "thousands" of severe zero-days rests on extrapolation from just 198 manually reviewed vulnerability reports, where expert contractors agreed with Claude's severity assessment 89% of the time. Critics point out that many flagged bugs are in older software or are impractical to exploit, and that other frontier models can likely replicate similar capability. The piece frames Mythos less as a unique breakthrough and more as a "sales pitch" for Anthropic's enterprise cybersecurity play.

Why it matters: The debate is shifting from "is Mythos powerful?" to "is the capability unique and are the claims rigorous?" This matters for Project Glasswing partners making procurement decisions based on Anthropic's system card data.

Industry News & Business

The biggest story today isn't a model release. It's that Anthropic has quietly passed OpenAI in annualized revenue, hitting $30B ARR while spending a fraction of what OpenAI spends on training. That inversion, the scrappier lab outearning the incumbent, is the kind of shift that reshapes industry narratives. Meanwhile, the physical safety of AI executives is becoming an unavoidable concern after a second attack on Sam Altman's home in days, and the three major US frontier labs have formalized their first joint intelligence-sharing operation against Chinese model distillation.

Anthropic Hits $30B ARR, Surpasses OpenAI

Anthropic's annualized revenue has reached $30 billion, tripling from $9 billion at year-end 2025 and surpassing OpenAI's $25B ARR for the first time. The growth is enterprise-driven: 80% of revenue comes from business customers, and the company has doubled its count of million-dollar-plus accounts from roughly 500 in February to over 1,000 today. SaaStr notes Anthropic achieved this while spending approximately 4x less than OpenAI on model training.

Why it matters: This is the first time a challenger has overtaken OpenAI on revenue. Anthropic's enterprise-heavy mix and capital efficiency suggest a fundamentally different business model, one that may prove more sustainable as frontier training costs continue to climb.

Update: Second Attack on Sam Altman's Home, Two Arrested

Days after a molotov cocktail was thrown at his Russian Hill residence (covered April 11), OpenAI CEO Sam Altman's home was targeted again early Sunday morning. According to SFPD reports, a car stopped in front of the property around 2:56 AM and a passenger fired a shot through the window. Two suspects, Amanda Tom (25) and Muhamad Tarik Hussein (23), were arrested and booked on negligent discharge charges. No injuries were reported. Authorities have not confirmed whether the two incidents are connected.

Why it matters: Two attacks on a tech CEO's home in one week is unprecedented in Silicon Valley. Regardless of motive or connection, this escalation raises serious questions about the personal security of prominent AI figures and the intensity of public backlash against the industry.

Frontier Labs Unite Against Chinese Model Distillation

OpenAI, Anthropic, and Google have begun sharing intelligence through the Frontier Model Forum to detect and block adversarial distillation attempts by Chinese AI firms. According to reporting by Bloomberg and Japan Times, Anthropic alleges that DeepSeek, Moonshot AI, and MiniMax collectively generated over 16 million exchanges with Claude via roughly 24,000 fraudulent accounts. US officials estimate adversarial distillation costs American labs billions annually. OpenAI has separately submitted a formal memo to the House Select Committee on China detailing DeepSeek's "new, obfuscated methods."

Why it matters: This is the first coordinated defensive operation between competing frontier labs. It signals that model distillation has escalated from a nuisance to a strategic threat, and that geopolitical competition is now a first-order concern in AI business strategy.

Claude Code Cache TTL Regression Sparks Community Revolt

A community investigation has uncovered evidence that Anthropic silently reduced Claude Code's prompt cache TTL from 1 hour to 5 minutes around early March 2026. The regression means any session pause over 5 minutes forces full context re-upload at write rates, causing 20-32% cost inflation and unexplained quota exhaustion for paying subscribers. A separate reverse-engineering effort found two additional caching bugs that can multiply token consumption by up to 20x. Anthropic's Claude Code product lead Lydia Hallie acknowledged the issue publicly, and a partial fix shipped in v2.1.88, but core bugs remain unpatched.

Why it matters: When a developer tool silently becomes 20x more expensive, trust erodes fast. This is the kind of operational issue that can drive enterprise customers to evaluate alternatives, especially as local models become increasingly viable.

Reddit Community Highlights

The community mood this weekend is a cocktail of excitement and frustration. Local model enthusiasts are riding high on MiniMax M2.7 benchmarks and Gemma 4 audio support, while Claude users are channeling their inner detective to explain why their subscriptions feel degraded. The cache TTL investigation on r/ClaudeAI has the energy of a grassroots audit, and the "golden age is over" sentiment is gaining traction. On the local side, the question is no longer "can you run frontier models locally?" but "which one should you run?"

r/LocalLLaMA

Audio Processing Lands in llama-server with Gemma 4. The confirmation that llama.cpp's llama-server now supports speech-to-text via Gemma 4 E2A and E4A models was one of the top posts. This represents a major milestone for the local inference ecosystem, bringing multimodal audio capabilities to the same server infrastructure already handling text and vision workloads.

Reddit thread: Audio processing landed in llama-server with Gemma-4

Speculative Decoding Works Great for Gemma 4 31B with E2B Draft. Controlled benchmarks on an RTX 5090 showed +29% average and +50% on code generation using the 4.65B E2B model as a draft for the full 31B. The community is especially excited that the draft/main model pairing works naturally within the same model family.

Reddit thread: Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

GLM 5.1 Sits Alongside Frontier Models in Social Reasoning Benchmark. A custom benchmark pitting LLMs against each other in autonomous games of Blood on the Clocktower (a complex social deduction game) shows GLM 5.1 performing competitively with frontier models. The community appreciates creative benchmarks that test capabilities standard evals miss.

Reddit thread: GLM 5.1 sits alongside frontier models in my social reasoning benchmark

r/ClaudeAI

"The Golden Age Is Over." A heavily-discussed post argues that consumer LLM access has peaked. The author, who subscribes to Claude, ChatGPT, Gemini, and Perplexity, reports that Claude's lead on text analysis tasks has evaporated over the past three weeks, with competitors catching up or surpassing it. The post reflects a broader sentiment shift in the community.

Reddit thread: The golden age is over

Claude Code Max Burns Limits 40% Faster with 20K Less Usable Context. A detailed proxy analysis claims Claude Code v2.1.100+ silently adds approximately 20K invisible tokens to every request server-side, eating limits faster and potentially degrading output quality. The author recommends downgrading to v2.1.98 for immediate relief. This post, combined with the cache TTL investigation, is fueling a narrative of silent degradation.

Reddit thread: Why Claude Code Max burns limits 40% faster with 20K less usable context. Proxy evidence inside.

Cache TTL Silently Regressed from 1h to 5m. The community is amplifying the GitHub issue documenting the cache regression, with users connecting it to their own experiences of suddenly hitting quota limits for the first time. The thread is a mix of technical analysis and user frustration at the lack of official communication.

Reddit thread: Did they just find the issue with Claude? "Cache TTL silently regressed from 1h to 5m"

r/LocalLLM

Small Local LLMs for Browser Agents. A practical demo running Qwen3:8b as planner and Gemma4:e4b as executor in an accounts payable workflow shows that small local models can handle browser-agent tasks if the runtime provides the right abstraction layer. The community is interested in hybrid local/cloud agent architectures that keep costs down.

Reddit thread: Small local LLM for browser agents: qwen3:8b + gemma4:e4b on a finance workflow

Is Gemma 4 Really Better Than Haiku 4.5? A discussion around Gemma 4 31B beating Haiku 4.5 on agentic coding livebench, asking whether it's truly time to switch from cloud APIs to local. The consensus leans toward "yes for coding, not yet for general chat," reflecting the community's nuanced view of benchmark-vs-vibes.

Reddit thread: Is Gemma 4 really better than Haiku 4.5 and Gemini 3.1 Flash Lite?

Coding Agent Framework for 24/7 Local LLM Use. A user with both an RTX 4080 and Strix Halo asks about frameworks that can automatically break down features, rate complexity, and route to local or cloud models. The responses highlight the growing demand for intelligent routing in hybrid inference setups.

Reddit thread: Coding agent framework for 24/7 use of local LLMs?

r/huggingface

KIV: 1M Token Context Window on RTX 4070. The standout post describes a drop-in DynamicCache replacement using tiered retrieval across VRAM, system RAM, and disk. The community is cautiously excited but waiting for independent accuracy benchmarks before declaring victory.

Reddit thread: KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement

r/accelerate

Second Attack on Sam Altman's Home. The subreddit amplified news of the shooting incident at Altman's Russian Hill residence, coming just days after the molotov cocktail attack. Discussion focused on the escalating hostility toward AI leaders and whether this represents a broader societal backlash.

Reddit thread: Sam Altman's home targeted in second attack

South Korea Declares Internet a Basic Right. South Korea's three major carriers will provide unlimited 400 Kbps data to 7 million users after their monthly plans run out, saving users roughly $219 million annually. The community sees this as a template for how connectivity policy should evolve in the AI era.

Reddit thread: South Korea's telecom giants surprise 7 million users with unlimited, universal internet

r/unsloth

MiniMax-2.7 Can Now Be Run Locally. Unsloth's official post confirming all GGUF quants are uploaded and verified. The guide notes the Dynamic 4-bit MoE model fits on 128GB Mac or equivalent RAM/VRAM setups. Community excitement is high given MiniMax M2.7's SOTA results on SWE-Pro and Terminal Bench 2.

Reddit thread: MiniMax-2.7 can now be run locally!

"Losing Hope" on AI Coding. A frustrated user with six months of experience across five stalled projects asks whether AI will ever reliably write working codebases, not just individual functions. The post resonated with the community, sparking discussion about the gap between demo-quality and production-quality AI coding.

Reddit thread: Losing hope