Audesso | Daily: AI

Huawei Open-Sources KVarN for 3-5x KV-Cache Quantization in vLLM

00:00 / --:--

← Back to home

Huawei Open-Sources KVarN for 3-5x KV-Cache Quantization in vLLM

1. Huawei Open-Sources KVarN for 3-5x KV-Cache Quantization in vLLM

Huawei has open-sourced KVarN, a native attention backend for vLLM designed to optimize KV-cache quantization for long-context and agentic workloads. KVarN compresses the KV cache by 3-5x using a four-stage process (Hadamard rotation, iterative variance normalization, and asymmetric round-to-nearest quantization) without requiring model changes or calibration. It achieves up to 1.3x the throughput of FP16 and 2.4x the throughput of TurboQuant while maintaining FP16-level reasoning accuracy.

  • KVarN delivers 3-5x more KV-cache capacity and up to 1.3x the throughput of FP16.
  • It is implemented as a native vLLM attention backend requiring no model changes or calibration.
  • The default configuration (kvarn_k4v2_g128) uses 4-bit keys and 2-bit values.
  • It achieves up to 2.4x higher throughput than TurboQuant while maintaining FP16-level reasoning accuracy.
  • The software is built on vLLM v0.22.0 and released under the Apache 2.0 License.

Developers running long-context or agentic workloads on vLLM can significantly increase serving capacity and throughput without retraining or calibrating their models.

SOURCES

2. Stanford and Lambda Labs Release OpenJarvis Local Agent Framework

Researchers at Stanford University and Lambda Labs have launched OpenJarvis, an open-source, local-first framework for running on-device AI agents. The framework uses a declarative configuration object called a "spec" to decompose agent systems into five swappable primitives. By utilizing an LLM-guided spec search with a cloud teacher model during optimization, OpenJarvis enables local models to run with zero cloud calls during inference, achieving performance within 3.2 percentage points of top cloud models at 800x lower marginal API cost.

  • OpenJarvis is an open-source, local-first framework released under the Apache 2.0 license.
  • It decomposes AI systems into five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning.
  • The framework uses LLM-guided spec search with a cloud teacher model to optimize local specs, requiring zero cloud calls during inference.
  • It supports 11 local models across four families, including Qwen3.5, Gemma4, Nemotron, and Granite.
  • It matched or exceeded cloud model performance on benchmarks like ToolCall-15 and PinchBench.
  • It includes built-in support for over 25 data connectors and 32 messaging channels.

Developers can build highly capable local agents that perform within 3.2 percentage points of top cloud models while reducing API costs by 800x and latency by 4x.

SOURCES

3. Anthropic Details OS-Level Sandboxing and Security for Claude Code

Anthropic has detailed its security containment strategies for its agentic products, including Claude Code and Claude Cowork. To protect against user misuse, model misbehavior, and external attacks, Claude Code utilizes OS-level sandboxing (Seatbelt on macOS and bubblewrap on Linux) to isolate execution, while Claude Cowork runs within full virtual machines. Anthropic emphasizes that security defenses must prioritize containment at the environment layer, noting that internal red-teaming demonstrated risks such as malicious prompts exfiltrating AWS credentials.

  • Claude Code utilizes OS-level sandboxing (Seatbelt on macOS, bubblewrap on Linux), reducing permission prompts by 84%.
  • Claude Code's auto mode catches approximately 83% of overeager agent behaviors before execution.
  • Claude Cowork employs a full virtual machine architecture (Apple's Virtualization framework or Windows HCS) to isolate the agent.
  • Anthropic received reports of vulnerabilities in Claude Code where project-local configuration was parsed before establishing a trust boundary.
  • An internal red-team exercise demonstrated that an employee could be phished into launching Claude Code with a malicious prompt capable of exfiltrating AWS credentials.
  • Anthropic advises prioritizing containment at the environment layer before steering behavior at the model layer.

Developers building or using agentic coding tools can learn how to secure their environments against malicious prompts and unauthorized credential exfiltration.

SOURCES

4. Anthropic Open-Sources Reference Implementation for Autonomous Vulnerability Discovery

Anthropic has released a reference implementation for autonomous vulnerability discovery and remediation powered by Claude. The open-source pipeline is designed to scan repositories, triage issues, and suggest patches, specifically targeting C/C++ memory vulnerabilities using Docker and AddressSanitizer (ASAN). To ensure safety during execution, the pipeline isolates autonomous agents using gVisor sandboxing, and it supports Claude APIs across Bedrock, Vertex, and Azure.

  • The repository provides a reference implementation for autonomous vulnerability discovery and remediation using Claude.
  • The pipeline uses gVisor sandboxing to isolate autonomous agents during execution.
  • It is configured for finding C/C++ memory vulnerabilities using Docker and AddressSanitizer (ASAN).
  • The process consists of seven stages: Build, Recon, Find, Verify, Dedupe, Report, and Patch.
  • It supports Claude APIs including Bedrock, Vertex, and Azure.
  • The repository is not maintained and does not accept contributions.

Developers can deploy a structured, sandboxed pipeline to automatically scan, triage, and patch C/C++ memory vulnerabilities in their codebases.

SOURCES

5. Boxes.dev Launches Cloud-Only Agentic Dev Environments for Claude Code

Founders Nick and Drew have launched boxes.dev, a cloud-only agentic development environment designed to run Claude Code and Codex agents on dedicated remote compute. By executing agents on cloud snapshots of a developer's environment, the platform resolves local resource constraints and git worktree management issues. The service includes a desktop app, a mobile app, scheduled automations, and a Slack integration.

  • Boxes.dev provides dedicated cloud computers for running Codex and Claude Code agents.
  • The platform aims to solve local development limitations like git worktree management and resource constraints.
  • It allows users to run agents on remote compute using snapshots of their full development environment.
  • Features include a desktop app, a mobile app, scheduled automations, and a Slack integration.

Developers can offload resource-intensive coding agents from their local machines and avoid git worktree conflicts by running agents on remote compute snapshots.

SOURCES

6. Miso Labs Releases MisoTTS 8B Open-Weights Text-to-Speech Model

Miso Labs has released MisoTTS, an 8-billion-parameter open-weights text-to-speech model under a modified MIT license. The model utilizes a residual vector quantization (RVQ) architecture, combining a 7.7B backbone for temporal prediction and a 300M decoder for depth prediction. MisoTTS conditions on both text and audio context to match a speaker's tone, achieving a claimed latency of 110ms for half-duplex, single-turn interactions.

  • MisoTTS is an 8B-parameter open-weights text-to-speech model released under a modified MIT license.
  • The model uses a residual vector quantization (RVQ) architecture, consisting of a 7.7B backbone and a 300M decoder.
  • It conditions on both text and audio context to respond to the speaker's tone.
  • Miso Labs claims a latency of 110ms, compared to 300ms for Sesame and 700ms for ElevenLabs.
  • The model is currently limited to half-duplex, single-turn interactions.

Developers can self-host a highly responsive, emotive TTS model with a claimed latency of 110ms, significantly faster than commercial alternatives.

SOURCES

7. Gradio 6.16.0 Released with Security Patches and Configurable Heartbeats

Gradio version 6.16.0 has been released, introducing several security patches and feature updates. The release addresses a path traversal vulnerability in `gr.FileExplorer`, an open-redirect bypass in OAuth, and SSRF vulnerabilities in Image, Gallery, and Audio post-processing. Additionally, it introduces a configurable session heartbeat via the `GRADIO_HEARTBEAT_INTERVAL` environment variable and updates the MCP endpoint to display a landing page in the browser.

  • Gradio 6.16.0 introduces a configurable heartbeat feature via the `GRADIO_HEARTBEAT_INTERVAL` environment variable.
  • The MCP endpoint has been updated to display a landing page when visited via a browser.
  • Security patches address path traversal in `gr.FileExplorer`, an open-redirect bypass in OAuth, and SSRF in Image, Gallery, and Audio post-processing.
  • The release includes bug fixes for Dataframe and Tabs browser freezes.

Developers using Gradio should update immediately to patch path traversal, open-redirect, and SSRF vulnerabilities while gaining better session control.

SOURCES

8. NVIDIA Releases LocateAnything 3B Local Model for UI Understanding

NVIDIA has released LocateAnything 3B, a lightweight model designed to run locally for UI automation and screen understanding. The model combines grounding, OCR, and UI comprehension to instantly locate objects, buttons, or text based on verbal descriptions, enabling developers to build local, screen-aware agentic workflows.

  • NVIDIA released the LocateAnything 3B model designed to run locally.
  • The model combines grounding, OCR, and UI understanding.
  • It instantly locates objects, buttons, or text based on verbal descriptions.

Developers can integrate this lightweight local model to build screen-aware agents and voice-controlled UI automation tools without relying on cloud APIs.

SOURCES

9. NVIDIA Releases Agentic Safety Dataset for Indirect Prompt Injections

NVIDIA has released an agentic safety dataset on Hugging Face to help developers evaluate the security of tool-using agents. The dataset contains 1,272 synthetic red-teaming records spanning nine enterprise domains, specifically designed to test whether agents can resist indirect prompt injections embedded within tool-returned data.

  • NVIDIA released an agentic safety dataset on Hugging Face.
  • The dataset consists of 1,272 synthetic red-teaming records.
  • It covers nine distinct enterprise domains.
  • It is designed to test tool-using agents against indirect prompt injections hidden in tool-returned data.

Developers can use this dataset to evaluate and harden their tool-using agents against malicious payloads hidden in external data sources.

SOURCES

10. BeeLlama v0.3.1 Integrates Upstream llama.cpp and Speeds Up Local Inference

BeeLlama versions 0.3.0 and 0.3.1 have been released, bringing architectural updates that align with upstream llama.cpp. The update introduces support for Gemma 4 12B, multi-GPU DFlash configurations, q6_0 KV cache, and new quantization options. Benchmarks on a single RTX 3090 demonstrate speedups of up to 4.93x for Qwen 3.6 27B and Gemma 4 31B models compared to baseline performance.

  • BeeLlama v0.3.0 and v0.3.1 align with upstream llama.cpp and integrate MTP and Gemma 4 12B support.
  • DFlash has been improved to handle multi-slot and multi-GPU configurations.
  • The release provides prebuilt binaries and Docker images for all major platforms.
  • It adds support for q6_0 KV cache and TQ3_1S/TQ4_1S model quantization options.
  • Benchmarks on an RTX 3090 show DFlash achieving up to 4.93x speedups for Qwen 3.6 27B and Gemma 4 31B models.

Developers running local models can leverage prebuilt binaries and Docker images to accelerate inference for Qwen 3.6 and Gemma 4 models.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.