Audesso | Daily: AI

Model Context Protocol Release Candidate Introduces Stateless HTTP Core

00:00 / --:--

← Back to home

Model Context Protocol Release Candidate Introduces Stateless HTTP Core

1. Model Context Protocol Release Candidate Introduces Stateless HTTP Core

This release candidate represents the largest revision of the Model Context Protocol (MCP) since its initial launch. By redesigning the core protocol to be stateless, it simplifies deployment across cloud and HTTP-based serverless environments, making it easier to scale agent interactions. Developers should review the new authorization specifications and prepare for the breaking changes before the stable release.

  • Features a stateless core tailored for HTTP infrastructure.
  • Adds official support for extensions and implements OAuth/OpenID Connect-aligned authorization.
  • Introduces breaking changes and a new formal deprecation policy.
  • Final version of the specification is scheduled for release on July 28.

The transition of the Model Context Protocol to a stateless HTTP core with OAuth/OpenID authorization and breaking changes forces immediate updates to custom MCP servers.

SOURCES

2. Authentication Providers Launch Managed OAuth Security for MCP Servers

As task-specific AI agents become increasingly integrated into enterprise applications, securing their tool calls has become a priority. To address this, the industry is standardizing on OAuth 2.1 with PKCE for protected HTTP-based MCP deployments. Leading identity and integration providers have launched native tools—ranging from WorkOS’s enterprise-ready SSO integrations to Arcade's identity-based permission runtimes—allowing developers to implement secure, policy-compliant authentication for their agent fleets.

  • The Model Context Protocol (MCP) reached 97 million monthly combined Python and TypeScript downloads by late 2025.
  • MCP HTTP-based deployments require OAuth 2.1 with PKCE, HTTPS, and Protected Resource Metadata (RFC 9728).
  • WorkOS provides MCP-compatible OAuth integrated with SSO, SCIM, and Fine-Grained Authorization (FGA).
  • Auth0 by Okta made its 'Auth for MCP' generally available on May 6, 2026.
  • Other platforms like Stytch, Arcade, and Cloudflare's Agents SDK offer specialized edge-native and policy-enforced MCP support.

Securing agentic tool calls and MCP servers requires implementing standardized OAuth 2.1 authentication, which is now natively supported across major identity providers.

SOURCES

3. WorkOS Releases Open auth.md Protocol for Agent Registration

The new auth.md protocol simplifies how autonomous agents and services discover and trust each other. By hosting a simple markdown file at a domain, services can publish supported registration flows, scopes, and credential management rules. This allows agents to register programmatically and receive credentials synchronously using existing OAuth standards.

  • Standardizes agent registration using a Markdown file hosted at the service's domain.
  • Built on top of existing OAuth standards and is entirely infrastructure-agnostic.
  • Features an 'Agent verified' flow utilizing ID-JAG for zero-human synchronous credential issuance.
  • Supports a 'User claimed' flow utilizing one-time passwords (OTP) to bind registrations to users.

Allows developers to expose standard registration endpoints for incoming AI agents without relying on proprietary authentication infrastructure.

SOURCES

4. Together AI Releases OSCAR for 2-bit KV Cache Quantization

Serving long-context models often bottlenecks on the massive memory footprints required by the KV cache. OSCAR (Offline Spectral Covariance-Aware Rotation) circumvents this by utilizing attention-aware rotation matrices to align quantization noise away from sensitive directions. By pairing INT2 history compression with a small BF16 buffer for recent and sink tokens, developers can scale context limits without suffering crippling accuracy drops or hardware bloat.

  • Achieves up to an 8x reduction in KV cache memory and up to a 3x increase in decode throughput at 100K context length.
  • Uses a mixed-precision layout: first 64 sink tokens and last 256 tokens in BF16, history tokens compressed to 2-bit INT2.
  • Maintains near-BF16 accuracy on models like Qwen3-32B and GLM-4.7-FP8.
  • Fully integrated with SGLang, supporting paged attention and prefix caching.
  • Pre-computed rotation matrices and clip thresholds are available in the RotationZoo repository.

Reduces the massive memory footprint of running long-context LLMs locally or on dedicated endpoints by 7-8x with minimal loss in reasoning accuracy.

SOURCES

5. NuExtract3: Open-Weight 4B VLM for Structured Document Extraction

As the successor to the NuMarkdown model, NuExtract3 specializes in turning unstructured visual documents into clean, structured Markdown or data formats. Its low memory requirement makes it highly appealing for cost-conscious developers who want to run dedicated, self-hosted document processing pipelines locally or in serverless environments.

  • Released under an Apache-2.0 license and based on Qwen3.5-4B.
  • Designed for structured extraction from PDFs, screenshots, forms, tables, and invoices.
  • Requires as little as 4GB of VRAM to run.
  • Compatible with Safetensors, GGUF, and MLX weights.
  • Tested and compatible with vLLM, SGLang, and llama.cpp.

Provides a highly efficient, self-hostable alternative to commercial APIs for high-accuracy document parsing and OCR tasks.

SOURCES

6. Clerk Releases Open-Source CLI for Headless Auth in Agents

By shifting authentication management into a scriptable command-line interface, Clerk removes the need to log into a browser dashboard to manage tenant access. Because the CLI is open source and designed with agents in mind, it provides a clean pathway for developers to give their automated processes secure, granular control over identity boundaries.

  • Includes 'clerk init' for scaffolding, 'clerk config' for code settings, and 'clerk api' for headless operations.
  • Allows programmatically fetching users, organizations, and sessions.
  • Open source and optimized for integration into agentic harnesses.

Enables automated agents to execute identity management tasks programmatically without manual dashboard intervention.

SOURCES

7. Reasonix: Terminal-Based DeepSeek Coding Agent

Reasonix targets developers who prefer keeping their coding loops within the terminal. By optimizing agent interactions around DeepSeek's native prefix-caching behavior, the tool significantly reduces the recurrent prompt-processing fees typically associated with context-heavy, multi-turn programming tasks.

  • Engineered as a DeepSeek-native coding agent designed specifically for terminal environments.
  • Built around prefix-cache stability to sustain long-running developer sessions.
  • Optimized to minimize token costs during extended code editing.

Enables developers to run long, interactive terminal coding sessions at a low token cost by leveraging stable caching.

SOURCES

8. llama.cpp PR Optimizes Prompt Reprocessing for Agentic Coding

Interactive coding tools often rewrite past messages or modify prompt histories, which conventionally forces llama.cpp to waste cycles reprocessing tens of thousands of tokens. This optimization dramatically shortens wait times during agentic sessions. Developers running local workflows should also note that retaining model-generated 'thinking' tags helps maintain context-cache alignment.

  • Addresses the issue where agent tools like 'opencode' rewrite context, forcing re-processing of up to 70k tokens.
  • Ensures llama.cpp only reprocesses altered sections of the prompt context.
  • Notes that models stripping thinking/reasoning tags can also trigger full prompt re-processing.
  • Recommends enabling 'preserve thinking' (such as in Qwen 3.6) to avoid reasoning context losses.

Improves the interactive latency of local coding assistants that frequently rewrite conversational history or strip reasoning tags.

SOURCES

9. llama.cpp CUDA Update Implements Fast Walsh-Hadamard Transform

Quantizing the KV cache is a popular way to fit long-context models onto consumer GPUs, but it can introduce computational overhead. This pull request targets that bottleneck directly on CUDA devices. The integration of the Fast Walsh-Hadamard Transform ensures that key-value quantization operations run faster, resulting in snappier local text generation.

  • Implements Fast Walsh-Hadamard Transform (FWHT) for CUDA-based KV-cache quantization.
  • Provides a 1-2% performance boost for prompt processing and a 7-9% boost for token generation.
  • Tested on NVIDIA RTX 5090 using gemma4 26B with 8-bit quantized keys and values (-ctk q8_0 -ctv q8_0).

Developers running quantized local inference on NVIDIA GPUs will see immediate throughput gains of up to 9%.

SOURCES

10. OpenAI Launches Macro-Evaluation Workflow for Multi-Agent Systems

Debugging complex agent setups manually is notoriously difficult due to the non-deterministic nature of multi-step reasoning. OpenAI's new macro-evaluation approach solves this by aggregating execution metrics over high volumes of runs. Developers can now identify recurring failure paths, architectural bottlenecks, and systemic issues across their entire agent fleet rather than chasing individual edge-case bugs.

  • Focuses on analyzing macro patterns across entire populations of traces.
  • Moves away from evaluating isolated, individual agent failures.
  • Introduced by OpenAI to improve the predictability of multi-agent deployments.

Shifts agent evaluation away from brittle manual checks of individual failures to aggregate population-level analysis of execution traces.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.