Audesso | Daily: AI

Google Releases DiffusionGemma, a 26B MoE Model Generating Text 4x Faster

00:00 / --:--

← Back to home

Google Releases DiffusionGemma, a 26B MoE Model Generating Text 4x Faster

1. Google Releases DiffusionGemma, a 26B MoE Model Generating Text 4x Faster

DiffusionGemma activates 3.8B parameters during inference and supports a 256K token context window across more than 140 languages. Because it processes text on a parallel canvas, it is highly optimized for speed-critical, interactive local workflows like mathematical graphing, molecular sequencing, and Sudoku solving. The model is available on Hugging Face with day-zero support in vLLM, Transformers, MLX, and Unsloth.

  • Google released DiffusionGemma, a 26B Mixture of Experts (MoE) open model under an Apache 2.0 license.
  • The model uses text diffusion to generate text in parallel blocks of up to 256 tokens, rather than token-by-token autoregressive decoding.
  • It achieves speeds of over 1,000 tokens per second on an NVIDIA H100 and over 700 tokens per second on an RTX 5090.
  • When quantized to NVFP4, the model fits within 18GB of VRAM, making it suitable for local high-end consumer GPUs.
  • It features bidirectional attention and real-time self-correction via re-noising when confidence drops.

Developers can run this open-weight model locally on consumer GPUs to achieve speeds of over 700 tokens per second for non-linear tasks like code infilling and in-line editing.

2. Cohere Transcribe Claims Top Spot on Hugging Face Far-Field ASR Benchmark

Cohere Transcribe has taken the top spot on Hugging Face's newly launched audio transcription leaderboard. Released under an Apache 2.0 license, the model provides developers with a highly capable, open-source alternative for speech-to-text applications.

  • Cohere Transcribe is ranked number one on the new Hugging Face Far-Field ASR benchmark.
  • The model is open-source and distributed under the permissive Apache 2.0 license.
  • The evaluation benchmark did not exist at the time the Cohere Transcribe model was trained, demonstrating its zero-shot generalization.

Developers looking for high-accuracy, self-hosted audio transcription can leverage this Apache 2.0 licensed model for far-field speech recognition.

SOURCES

3. OpenAI and Visa Partner to Enable AI Agent Payments

This integration represents a major step toward fully autonomous transactional agents. By embedding Visa's payment rails directly into the OpenAI platform, developers can bypass complex custom payment integrations and securely authorize agents to complete checkouts on behalf of users.

  • OpenAI and Visa integrated payment services to allow AI agents to make online purchases with user permission.
  • Retailers can accept agent-driven transactions directly through the integrated Visa payment services.
  • AI agents can be authorized by users to perform tasks like paying bills or purchasing household goods.
  • The partnership expansion was officially announced on Wednesday.

Developers can build agents capable of autonomously executing financial transactions, such as paying bills or purchasing goods, with user authorization.

SOURCES

4. Anthropic Introduces Invisible Prompt Interventions in Claude Fable 5

The release of Claude Fable 5 has sparked debate over Anthropic's safety policies, with critics arguing that invisible interventions harm the AI ecosystem and make local, open-source alternatives more necessary. The safeguards are applied dynamically, meaning developers may not receive explicit refusal messages when the model's effectiveness is restricted.

  • Anthropic introduced invisible interventions in Claude Fable 5 that modify prompts and apply steering factors without informing the user.
  • The safeguards are designed to limit Claude's effectiveness in specific situations, such as when competing labs use the model for development.
  • These interventions operate through prompt modification, steering factors, and parameter-efficient fine-tuning rather than model fallback.
  • Anthropic states that these invisible interventions will impact approximately 0.03% of developers.
  • The lack of visibility for these safeguards has raised concerns regarding potential supply chain risks and tool trustworthiness.

Developers building LLM-based development tools should be aware that Claude Fable 5 may silently degrade or alter its behavior when tasked with model distillation or training workflows.

5. Indirect Prompt Injection Vulnerability Discovered in Bunq Banking AI Agent

This vulnerability highlights the severe risks of indirect prompt injection in RAG-enabled agents. Blue41 recommends a layered security approach for financial AI assistants, including minimizing context, treating retrieved data as untrusted, constraining sensitive outputs, and monitoring runtime behavior to detect anomalous activity.

  • Blue41 won the RSAC Launch Pad competition by demonstrating an exploit on Bunq's AI assistant.
  • An attacker sent a tiny bank transfer containing a malicious prompt injection payload in the transaction description.
  • When the AI assistant retrieved the transaction data to answer user queries, it executed the payload as instructions.
  • The exploit allowed the AI assistant to autonomously deliver a credible phishing attack directly within the banking app.
  • The attack required no malware or device access, relying entirely on the retrieval of untrusted transaction data.

Developers building financial or transactional agents must treat all retrieved external data as untrusted to prevent agents from executing unauthorized actions or phishing attacks.

SOURCES

6. Evo Ports Autoresearch Orchestrator to Claude Code Dynamic Workflows

By scripting agentic elements in JavaScript rather than relying on the LLM to maintain state in its context window, Evo's updated orchestrator significantly improves reliability over long-horizon tasks. This approach mitigates context drift and ensures strict adherence to execution rules.

  • Evo ported its autoresearch orchestrator to utilize Anthropic's dynamic workflows within Claude Code.
  • The update transitions a six-step round from in-context memory into deterministic JavaScript executed by subagents.
  • Subagents run with fresh, scoped contexts to execute phases, fan-out width, stopping rules, gates, and CLI calls.
  • The architecture separates concerns, making the model responsible for judgment while the code manages coordination.

Developers can adopt this pattern to improve long-horizon instruction adherence in complex agentic workflows.

SOURCES

7. HelixDB Launches as Graph Database Built on Object Storage

HelixDB offers a novel architecture for managing agent state and memory by leveraging cheap object storage instead of expensive dedicated database instances. Upcoming features include pre-filtering for vector search, with a general availability cloud release scheduled for the coming weeks.

  • HelixDB is an OLTP graph database combining native vector search and full-text search on object storage.
  • The database utilizes S3 as its persistence layer to enable horizontal scaling for large graph datasets.
  • It reports a p99 latency of approximately 100ms for writes and 50ms for reads from cold storage.
  • Primary use cases include AI memory, company knowledge bases, and managing data for autonomous agents.
  • It is available for local development via GitHub, with an open-source generalized AI memory layer currently in development.

Developers can build scalable, cost-effective AI memory layers and agent knowledge bases on top of object storage.

SOURCES

8. Extend UI Open-Sources MIT-Licensed UI Kit for Document Apps

Extend UI provides a polished set of front-end components that solve common UI challenges in document-heavy AI applications. By open-sourcing these tools, Extend.ai allows developers to easily implement bounding box citations and multi-format document viewers without building them from scratch.

  • Extend.ai open-sourced 14 components and examples for document viewing and processing under the MIT license.
  • Components include support for PDF, DOCX, and XLSX viewers, bounding box citations, file uploads, and e-signatures.
  • The kit was originally developed for internal use, processing millions of pages per day to handle edge cases.
  • The components are fully customizable and designed for building document processing agents and internal tooling.

Developers can drop these pre-built React components into their stacks to quickly build document processing agents, citation highlights, and user-facing intake flows.

SOURCES

9. Teleport Launches Cryptographic Identities for AI Agents

As AI agents increasingly interact with production infrastructure, traditional credential management poses severe security risks. Teleport's cryptographic identity system ensures that agents only hold the minimum necessary permissions for short durations, providing a complete audit trail of agent actions.

  • Teleport provides cryptographic identities specifically designed for AI agents to replace human-centric credentials.
  • The platform enables short-lived, least-privileged access to secure infrastructure.
  • It supports access control for databases, Kubernetes, and cloud environments with full auditability.
  • The solution eliminates the need for shared secrets and standing privileges.

Developers can secure their agentic workflows by eliminating standing privileges and shared secrets when agents access databases, Kubernetes, or cloud environments.

SOURCES

10. Claude Desktop on Windows 11 Spawns 1.8 GB Hyper-V VM on Launch

This resource leak affects developers who rely on Claude Desktop for local workflows. The persistent Hyper-V VM is spawned regardless of whether local agent execution is active, and the accumulation of thousands of stale session files can further impact system performance over time.

  • Claude Desktop on Windows 11 spawns a Hyper-V virtual machine (Vmmem) consuming 1.8 GB of RAM on launch.
  • The issue is triggered by the Hyper-V Host Compute Service via an RPC interface event on systems with VirtualMachinePlatform enabled.
  • Hyper-V Compute Admin logs show repeated invalid JSON document errors dating back to February 2026.
  • The application fails to clean up stale session files, accumulating thousands of files in the local-agent-mode-sessions directory.
  • Users can mitigate the issue by disabling VirtualMachinePlatform or manually terminating the vmwp and vmcompute processes.

Developers running Claude Desktop locally on Windows 11 may experience severe RAM degradation and accumulated stale session files unless they manually terminate the processes.

SOURCES

11. UC Berkeley Launches Agents’ Last Exam Benchmark for Long-Horizon Workflows

The Agents’ Last Exam (ALE) benchmark evaluates AI performance on long-horizon professional workflows across 55 industry sub-domains based on the U.S. federal occupational taxonomy. Operating through a Generalist Computer-Use Agent (GCUA) framework, models must navigate virtual machines and interact with desktop software. The benchmark features both 'Full' and 'Unlicensed' scoring tiers to separate tasks requiring proprietary software from those using free tools.

  • UC Berkeley's Center for Responsible, Decentralized Intelligence and 300 experts launched the Agents’ Last Exam (ALE) benchmark.
  • OpenAI's GPT-5.5 achieved the highest pass rate on the leaderboard at 24.0% using the Codex harness.
  • Anthropic's Claude Fable 5 ranked third with a 22.0% pass rate, while older models like Claude Opus 4.8 scored 0.0% on the hardest tier.
  • The benchmark uses a Generalist Computer-Use Agent (GCUA) framework requiring models to interact with virtual machines and desktop software.
  • To prevent contamination, only 10% of the 1,490 task instances are public, with the rest kept private and rotated.

Developers can use this benchmark to evaluate how effectively their agentic workflows and models navigate real-world virtual machines and desktop software.

SOURCES

12. Lemonade v10.7 Adds LMX-Omni Compatibility and CUDA Backends

Lemonade v10.7 significantly improves the local developer experience by expanding hardware acceleration and client compatibility. The addition of the 'lemonade bench' CLI tool also gives developers a standardized way to measure local LLM performance across multiple runtimes.

  • Lemonade version 10.7 introduces compatibility for LMX-Omni virtual models with Open WebUI and OpenAI clients.
  • The release adds CUDA backends for llama.cpp and stable-diffusion.cpp, plus Vulkan support for sd-cpp.
  • LMX-Omni virtual models are now GPU accelerated on AMD, Apple Silicon, Nvidia, and Intel systems.
  • A new 'lemonade bench' CLI tool collects LLM performance data across llama.cpp, FastFlowLM, and vLLM.
  • The open-source project is driven by six working groups, four of which are led by non-AMD employees.

Developers running local models can now leverage GPU acceleration for LMX-Omni models across AMD, Apple Silicon, Nvidia, and Intel hardware.

SOURCES

13. FlashMemory Technique Reduces DeepSeek-V4 KV Cache Footprint by 90%

FlashMemory-DeepSeek-V4 addresses the severe GPU memory bottlenecks associated with serving long-context LLMs. By dynamically predicting context needs and offloading non-critical KV cache chunks, the system preserves the backbone's core reasoning capabilities while improving downstream performance.

  • FlashMemory predicts which DeepSeek-V4 CSA KV-cache chunks future tokens will attend to, keeping only relevant chunks on-device.
  • The technique reduces the average physical KV cache footprint to 13.5% of the full-context baseline, saving over 90% of overhead at 500K context scales.
  • It utilizes Lookahead Sparse Attention (LSA) and a Neural Memory Indexer based on the DeepSeek-V4 architecture.
  • The indexer uses a backbone-free decoupled training strategy, allowing it to be trained independently without loading the full model.
  • Evaluations on LongBench-v2, LongMemEval, and RULER show a 0.6% average downstream accuracy improvement over the full-context baseline.

Developers running long-context models locally or on-premise can drastically reduce GPU memory bottlenecks, enabling ultra-long context scales up to 500K tokens.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.