Audesso | Daily: AI

Alibaba Launches Qwen3.7-Max with Anthropic API Compatibility

00:00 / --:--

← Back to home

Alibaba Launches Qwen3.7-Max with Anthropic API Compatibility

1. Alibaba Launches Qwen3.7-Max with Anthropic API Compatibility

Alibaba Cloud announced the proprietary Qwen3.7-Max reasoning model at the Alibaba Cloud Summit. Concentrating performance gains on coding and scientific reasoning, the model scored 56.6 on the Artificial Analysis Intelligence Index. In addition to text inputs and extended-thinking reasoning steps, its support for the Anthropic API protocol lets developers immediately deploy it as an alternative backend for tools like Claude Code.

  • Features a 1M token context window and a 64K maximum output limit.
  • Supports the Anthropic API protocol directly, enabling use in Claude Code.
  • Pricing is set at $2.50 per 1M input tokens and $7.50 per 1M output tokens.
  • Demonstrated 35 hours of continuous autonomous execution with 1,158 tool calls in internal tests.
  • Currently proprietary and accessible only via Chinese-based endpoints.

Developers can integrate a highly capable agentic model into existing Claude Code workflows simply by swapping to Chinese-based Qwen3.7-Max endpoints.

2. CopilotKit Releases AIMock and AG-UI Tools for Agent Development

Seattle-based startup CopilotKit has introduced three vendor-neutral tools aimed at productionizing agent workflows. Developers can use AIMock to handle schema drift detection, chaos testing, and record-and-replay behaviors without incurring token costs or managing actual API keys. Additionally, the Pathfinder MCP server makes local documentation, codebases, and Notion pages queryable using hybrid vector and keyword retrieval.

  • AIMock simulates 11 LLM providers, MCP, vector databases, and search endpoints using single JSON configurations.
  • The AG-UI protocol enables software agents to stream UI, sync application states, and request human-in-the-loop approvals.
  • Pathfinder is a self-hosted MCP server with pluggable embeddings for air-gapped knowledge retrieval.
  • AG-UI is supported by major providers like Google and Microsoft, and frameworks like PydanticAI and LangChain.

The new releases provide a streamlined, zero-dependency way to mock entire agent calls across 11 LLM providers, accelerating test environments.

SOURCES

3. Runtime Launches Open-Source Sandboxed Agent Environments

Runtime (YC P26) addresses the security risks and configuration complexity of deploying agent tools like Claude Code, Cursor, and Devin. By abstracting sandbox orchestration, it enables teams to share secure preview URLs of agent builds. The system's network egress controls and role-based access control prevent accidental data leaks during agent executions.

  • Snapshots full running environments (multi-service Docker Compose, Kafka, Redis, databases) in milliseconds.
  • Orchestrates across Daytona, E2B, EC2, and self-hosted Kubernetes sandboxes.
  • Includes a managed proxy for secret injection, command allow/deny lists, and egress controls.
  • The core of the platform is open source, and a hosted tier with compute-only pricing is available.

It allows developers to run untrusted agent code in highly complex environments without exposing local systems or production clusters.

SOURCES

4. Daytona Pivots to Agent-Native Compute with Ultra-Fast Sandboxes

Daytona has transitioned from human development environments to agent-focused compute, targeting the performance limits of modern container orchestrators. CEO Ivan Burazin claims that standard solutions like Kubernetes are inadequate for agent workloads, prompting a custom architecture built on bare-metal and stateful snapshot techniques. The service is positioned to act as a utility API for secure code execution.

  • Provides ultra-fast 60ms sandbox startups to run agentic code execution.
  • Capable of scaling to 50,000 startups in 75 seconds and handles 850,000 daily runs.
  • Avoids Kubernetes, opting for bare metal orchestration and stateful snapshots.
  • Approximately 50% of the platform's current usage is driven by reinforcement learning workloads.

Developers building LLM agents that run code can utilize 60ms startup environments designed specifically to handle high-volume execution and evaluations.

SOURCES

5. Docusign Introduces MCP Server for Claude and Gemini Integrations

Docusign has launched a suite of developer tools designed for agentic agreement workflows. This release allows common AI agents to interact directly with Docusign APIs under a unified governance and security context. App developers can utilize these tools to let their LLMs autonomously query past agreements, manage metadata, and draft or route documents.

  • Includes a dedicated Model Context Protocol (MCP) Server for Docusign capabilities.
  • Features an Agreement Manager API and an Agent Studio environment.
  • Supports bulk document ingestion and agent governance based on agreement history.
  • Enables Claude and Gemini models to trigger agreement actions directly via natural language.

Developers can now build natural-language agents that manage, ingest, and query Docusign agreements using standard frameworks.

SOURCES

6. Rmux Brings Playwright-Style SDK Automation to Terminals

RMUX acts as a programmable layer for local and remote command-line environments. By matching tmux keybindings and commands, it functions as a drop-in replacement while exposing an asynchronous API for external orchestration. The project enables developers to script terminal interactions, verify outputs, and manage parallel sessions programmatically.

  • Written in Rust and features a tmux-compatible CLI supporting roughly 90 commands.
  • Includes an async Rust SDK that provides stable pane IDs and locator-style waits.
  • Runs natively on Linux, macOS, and Windows via ConPTY without requiring WSL.

Developers building terminal-running agents can programmatically capture and drive console applications with stable pane IDs and structured state snapshots.

SOURCES

7. llama.cpp Fixes VRAM Leak in Multi-Token Prediction Server

A significant memory leak affecting the llama.cpp server when using Multi-Token Prediction (MTP) architectures has been patched. Previously, the server failed to release speculative decoders and draft configurations upon entering sleep cycles, steadily consuming VRAM. The update enforces a clean resource destruction order to guarantee full VRAM reclamation.

  • Pull request #23461 explicitly resets speculative decoders, draft context, and draft models.
  • Fixes a bug where resources in server_context_impl's destroy() function were leaked.
  • Resolves out-of-memory crashes triggered by repeated sleep and resume cycles of llama-server.

Developers running local Qwen 3.6 or other MTP models can pull the latest update to prevent out-of-memory errors caused by failed cleanup cycles.

SOURCES

8. ik_llama.cpp Speeds Up Local MTP Inference on 12GB GPUs

A local hardware benchmark has demonstrated substantial speedups for Multi-Token Prediction (MTP) inference when using ik_llama.cpp over standard llama.cpp. By pairing an RTX 4070 Super GPU with an iGPU for system monitor tasks, developers can utilize the full 12GB of VRAM to host a quantized 35B parameter model locally. The configuration achieves highly responsive outputs suitable for real-time coding assistants.

  • Achieved 110.24 tokens per second on an RTX 4070 Super 12GB using ik_llama.cpp.
  • Standard llama.cpp achieved 89.76 tokens per second on the same hardware setup.
  • Used a Qwen3.6-35B-A3B-IQ4_XS model quantized to 4.19bpw.
  • Requires using --fit-margin adjustments to manage tight VRAM allocations.

Developers running local model environments can achieve a 23% speed improvement over standard llama.cpp implementations.

SOURCES

9. Delta-Mem Adds Lightweight Working Memory to AI Agents

Delta-mem introduces an alternative memory structure for autonomous agents, addressing context-window scaling limits. Rather than relying on retrieval-augmented generation (RAG) for behavioural history, this method compresses dynamic interaction logs into a fast associative matrix. The approach leaves the core model frozen, allowing quick, lightweight state updates over long-horizon tasks.

  • Adds only 0.12% of the backbone model's parameters, compared to 76.40% for MLP memory baselines.
  • Implements an Online State of Associative Memory (OSAM) to update state without modifying frozen LLM weights.
  • Achieved 51.66% on benchmarks using a Qwen3-4B-Instruct backbone, beating the Context2LoRA baseline.
  • Code is available on GitHub and trained weights are hosted on Hugging Face.

Developers can equip agents with a lightweight behavioural memory adapter that maintains a fixed GPU memory footprint even at 32,000-token context lengths.

SOURCES

10. ByteDance Releases Lance 3B Unified Multimodal Model

ByteDance has released Lance, a 3B activated parameter dual-stream mixture-of-experts model trained from scratch. Lance uses Modality-Aware Rotary Positional Encoding (MaPE) to segregate its generation and understanding pathways cleanly. Although it demands a high-memory developer GPU to run locally, it offers unified multi-modal processing without swapping discrete models.

  • Unified architecture for understanding, generating, and editing both images and video.
  • Released under the Apache 2.0 license with weights available on Hugging Face.
  • Requires a GPU with at least 40 GB of VRAM and CUDA 12.4 or higher.
  • Scores 0.90 on GenEval and 85.11 on VBench, the highest among current unified models.

Provides an open-weights, Apache 2.0 alternative for building multi-modal video and image applications.

SOURCES

Daily AI signal in your inbox

5 minutes a day. Free, unsubscribe anytime.