Audesso | Daily: AI

EAGLE 3.1 Speculative Decoding Integrates into vLLM

00:00 / --:--

← Back to home

EAGLE 3.1 Speculative Decoding Integrates into vLLM

1. EAGLE 3.1 Speculative Decoding Integrates into vLLM

The EAGLE, vLLM, and TorchSpec teams have released EAGLE 3.1 to solve attention drift, a phenomenon where the drafter model shifts focus toward its own generated tokens at deeper speculation depths. The update stabilizes hidden-state magnitudes using FC normalization and post-norm feedback. It is backward-compatible with existing EAGLE 3 checkpoints and available directly in vLLM version 0.22.0.

  • Provides up to 2x longer acceptance lengths on long-context workloads.
  • Delivers 2.03x higher per-user output throughput at concurrency 1 on Kimi-K2.6-NVFP4.
  • Introduces FC normalization and post-norm hidden-state feedback to stabilize unnormalized residual paths.
  • Fully integrated into vLLM version 0.22.0 and backward-compatible with EAGLE 3 checkpoints.

Developers running local inference pipelines can now achieve up to 2.03x higher per-user output throughput without suffering from attention drift in long-context scenarios.

SOURCES

2. Robinhood Introduces Beta Stock Trading via Model Context Protocol

Robinhood has announced a beta integration that connects AI agents to its stock trading platform using the Model Context Protocol (MCP). The architecture limits agents to a dedicated wallet with user-defined budgets, providing real-time activity feeds and manual approval gates. Future expansion plans include support for options, cryptocurrency, event contracts, and futures.

  • Uses Model Context Protocol (MCP) to connect AI agents to trading infrastructure.
  • Restricts agent actions to a pre-loaded balance in a dedicated wallet.
  • Includes push notifications for each trade, a real-time feed, and manual pause capabilities.
  • Launches in beta for equities, with future plans for options, crypto, and futures.

This release provides a major production deployment of the Model Context Protocol (MCP) for secure, transaction-based agent workflows.

SOURCES

3. NVIDIA Releases Polar Rollout Framework under NeMo Gym

NVIDIA's new Polar framework introduces a gateway proxy at the model API boundary to intercept, normalize, and capture token-level data from standard agent completions. Operating without modifications to existing agent harnesses, the framework utilizes a prefix-merging trajectory reconstruction strategy to accelerate processing.

  • Intercepts API formats including Anthropic Messages, OpenAI Chat, and Google generateContent.
  • Delivers a 5.39x wall-clock speedup using prefix_merging trajectory reconstruction.
  • Improved SWE-Bench Verified scores by up to 22.6 points in experiments with Qwen3.5-4B.
  • Released as open source under the NeMo Gym repository.

Developers can now perform GRPO and offline SFT training on their agents using raw production API traffic from OpenAI, Anthropic, or Google.

SOURCES

4. Local Serving Optimization: Switching from Ollama to llama.cpp

Developer reports indicate that moving local workflows from Ollama to the native llama.cpp server yields significant quality gains. Implementing Q6 quantization instead of Q4, alongside Multi-Token Prediction (MTP) performance enhancements, allows local LLMs to match paid API performance. On dual 3090 GPU rigs, generation speeds reached 20 to 50 tokens per second.

  • Transitioning to the llama.cpp native server from Ollama unlocks better quantization options.
  • Upgrading from Q4 to Q6 quantization makes local model quality comparable to commercial APIs.
  • Multi-Token Prediction (MTP) provides notable speed and performance improvements.
  • Dual 3090 GPU systems running with thermal caps maintained 20 to 50 tokens per second.

This provides concrete setup adjustments for developers trying to run competitive, high-throughput coding agents locally without paid API reliance.

SOURCES

5. Gentle Parenting Prompting Halts Agent Reasoning Loops

A proof-of-concept project called Gentle-Coding demonstrates that high-pressure prompts threatening penalties trigger loops and cognitive freezing in LLMs. By adopting a 'Gentle Parenting' prompt style that validates task difficulty and allows the model to fail, tested models avoided infinite reasoning loops and successfully fell back on honest admissions of ignorance.

  • Tests show 'Authoritarian' prompts on unsolvable edge cases trigger infinite loops and timeouts.
  • Gentle framing prompts result in sub-second inference and metacognitive honesty.
  • Evaluation covered Gemini, Mistral, Poe, Perplexity, Haiku 4.5, and Nano-Banana2.
  • Theoretical frameworks and replication datasets are hosted in the Gentle-Coding GitHub repository.

Developers can apply these open prompting templates to stop agents from burning API tokens on complex or unsolvable tasks.

SOURCES

6. Architecting Environment-Layer Containment for Autonomous Agents

System security analysis emphasizes that agent containment must be designed at the environment layer. Since model-level steering is unreliable, isolating system interactions and applying strict limits on potential damage is recommended. Security policies and isolation levels should be dynamically matched to the operator's capacity for direct oversight.

  • Recommends isolation at the environment layer before applying model steering.
  • Urges developers to match containment strength to the supervisor's active oversight capacity.
  • Advises the deployment of battle-tested software components for agent runtime sandboxing.
  • Advocates setting hard physical and programmatic limits on potential system damage.

Developers building autonomous systems must move away from relying solely on system instructions for security, opting instead for hard environmental sandboxing.

SOURCES

7. Anthropic and OpenAI Shift Enterprise Tiers to Token-Usage Pricing

Both Anthropic and OpenAI have structured their enterprise plans to bill based on active API token usage rather than flat monthly seats. Anthropic shifted to a hybrid of $20 per seat plus usage, while OpenAI updated its Codex and ChatGPT Enterprise billing rules. The changes reflect the high compute demands of modern coding agents, which can exceed $900 in monthly API fees per user.

  • Heavy utilization of coding agents like Claude Code can drive monthly API costs past $900 per user.
  • Anthropic transitioned enterprise tiers to $20/seat plus variable API consumption costs.
  • OpenAI updated Codex and ChatGPT Enterprise pricing to align with token volume.
  • Both providers released expensive frontier models (GPT-5.5 and Opus 4.7) in April 2026.

Development teams building heavy coding agent workflows must adapt their financial models to accommodate token usage rather than fixed licensing fees.

SOURCES

8. PostHog to Train Internal AI Models on US Cloud Customer Data

Analytics platform PostHog has announced plans to train proprietary models on user telemetry starting June 29. The data will be used to enhance session replay analysis and synthetic user testing. Customers on US cloud instances are opted in by default, while EU cloud users and clients with custom legal agreements are opted out.

  • US cloud instance users are opted in to model training by default starting June 29.
  • EU cloud instances and enterprise users with custom BAAs or MSAs are opted out by default.
  • Users can opt out of the training program at any time through their organization settings.
  • Opting out disables access to new features developed with these trained models.

Developers hosting application telemetry on PostHog's US instance must manually opt out in organization settings if they want to prevent customer data from being trained on.

SOURCES

9. MEMO Framework Decouples Retrievable Memory from Core Reasoning

Researchers have proposed MEMO, a framework that splits agent memory and reasoning. It trains a small, dedicated MEMORY model using a five-step synthetic QA dataset pipeline, querying it via a three-stage protocol using a frozen, black-box EXECUTIVE model. It supports low-compute updates through model merging, bypassing the need for full retraining.

  • Uses a small MEMORY model alongside a frozen, black-box EXECUTIVE model.
  • Trains the memory model using fact extraction, consolidation, verification, entity surfacing, and cross-doc synthesis.
  • Supports incremental knowledge updates via model merging without full parameter fine-tuning.
  • Outperformed HippoRAG2 on NarrativeQA, MuSiQue, and BrowseComp-Plus.

Developers can update agent knowledge bases incrementally without altering underlying model weights, improving reasoning stability.

SOURCES

10. ReAligned-Qwen3.5 Released Under Apache 2.0

The ReAligned-Qwen3.5 model family is now available under an Apache 2.0 license. These models use an SFT and GRPO pipeline, using a custom ReAligned classifier as a reward signal to strip out Chinese ideological bias, state-narrative framing, and unnecessary refusal behaviors from the base Qwen weights.

  • Fine-tuned to eliminate Chinese ideological bias, censorship, and refusal behavior.
  • Utilized an SFT and GRPO pipeline with a ReAligned classifier reward signal.
  • Available in parameter sizes including 0.8B, 2B, 4B, 9B, 27B, and 35B-A3B.
  • Published on HuggingFace in standard BF16, FP8, and GGUF formats.

Developers seeking an uncensored local alternative built on Qwen's powerful architecture can deploy these weights in formats optimized for local hardware.

SOURCES

11. ITBench-AA Evaluates LLM Agents on SRE Incidents

ITBench-AA is a newly launched benchmark series designed to evaluate AI models on enterprise IT tasks, starting with Kubernetes incident response. The benchmark includes 59 SRE tasks executed inside sandboxed environments via the open-source Stirrup harness. Current evaluations show Claude Opus 4.7 leading at 47%, followed closely by GPT-5.5 at 46%.

  • Evaluates models on 59 Kubernetes incident response tasks utilizing the open-source Stirrup harness.
  • Stirrup provides shell access to a sandboxed file system containing logs and metrics.
  • Claude Opus 4.7 leads the benchmark at 47%, followed by GPT-5.5 at 46% and GLM-5.1 at 40%.
  • Data indicates longer agent turn counts do not correlate with higher accuracy due to false positives.

The open-source Stirrup harness provides developers with an actionable framework to build, sandbox, and test system-level agent environments.

SOURCES

12. Pure Triton Fused MoE Kernel Accelerates AMD Inference

A developer has released a fused dispatch kernel for Mixture-of-Experts (MoE) inference written completely in Triton. By fusing gate and up projections, the kernel reduces global memory traffic by 35% by keeping SwiGLU values in GPU registers. The kernel matches 89-131% of Stanford's CUDA-optimized Megablocks performance at batch sizes up to 512 tokens.

  • Written entirely in pure Triton to run natively on AMD MI300X with zero code changes.
  • Achieves 89-131% of the performance of Megablocks at batch sizes up to 512.
  • Fuses gate and up projections to decrease global memory traffic by 35%.
  • Fails to outperform Megablocks at batch sizes of 2048 or more, or with more than 64 experts under high routing skew.

Developers self-hosting MoE models can now achieve high-performance inference on AMD MI300X hardware with zero code changes, bypassing proprietary CUDA dependencies.

SOURCES

13. NVIDIA Integrates CompileIQ Auto-Tuning into CUDA 13.3

NVIDIA has integrated CompileIQ into its CUDA 13.3 software platform. The tool replaces standard compiler heuristics by using evolutionary algorithms to auto-tune settings for individual kernels. This multi-objective tuning enables developers to balance trade-offs across runtime performance, power constraints, and compilation times.

  • Integrated natively into the newly released CUDA 13.3 software platform.
  • Applies AI-driven evolutionary algorithms to customize compiler configurations per kernel.
  • Delivers up to 15% performance gains on already-optimized AI training and inference tasks.
  • Designed to optimize large language model (LLM) inference setups.

Developers managing high-throughput inference hosting setups can use CompileIQ to squeeze up to 15% more performance out of highly optimized GPU kernels.

SOURCES

14. Null Epoch MMO Simulator Yields 93k Event Agent Dataset

The Null Epoch stress-test project ran 25 agents across 8 open-weights models in an MMO-style environment for 10 days. The experiment tracked models like Gemma 3, Ministral, and Qwen3, outputting a 93,000-event dataset. Observations revealed that while Ministral maintained strong state awareness and Qwen3 235B formulated arbitrage strategies, all models struggled with ambiguous state signals.

  • Published a 93k logged event dataset on HuggingFace under a CC-BY-4.0 license.
  • Runs on an MIT-licensed Python SDK compatible with standard LLM endpoints.
  • Revealed that self-preservation must be explicitly defined to avoid an inverse correlation between aggression and wealth.
  • All tested models failed to navigate a Cooldown Paradox caused by ambiguous node availability signals.

Developers can analyze the published dataset and use the Python SDK to identify common agent state-handling failures and evaluate system prompts.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.